AE-smnsMLC: Multi-Label Classification with Semantic Matching and Negative Label Sampling for Product Attribute Value Extraction
Abstract
Product attribute value extraction plays an important role for many real-world applications in e-Commerce such as product search and recommendation. Previous methods treat it as a sequence labeling task that needs more annotation for position of values in the product text. This limits their application to real-world scenario in which only attribute values are weakly-annotated for each product without their position. Moreover, these methods only use product text (i.e., product title and description) and do not consider the semantic connection between the multiple attribute values of a given product and its text, which can help attribute value extraction. In this paper, we reformulate this task as a multi-label classification task that can be applied for real-world scenario in which only annotation of attribute values is available to train models (i.e., annotation of positional information of attribute values is not available). We propose a classification model with semantic matching and negative label sampling for attribute value extraction. Semantic matching aims to capture semantic interactions between attribute values of a given product and its text. Negative label sampling aims to enhance the model’s ability of distinguishing similar values belonging to the same attribute. Experimental results on three subsets of a large real-world e-Commerce dataset demonstrate the effectiveness and superiority of our proposed model.
Index Terms:
product attribute value extraction, multi-label classification, semantic matchingI Introduction
Product attribute value extraction is a fundamental NLP task in e-Commerce which can help improve customer shopping experience. Because it can be utilized for downstream tasks such as product search, product retrieval and recommendation. The most recent state of the art models proposed for this task include [21, 18, 9, 19, 23]. OpenTag [21] is a sequence labeling model using BiLSTM-CRF with attention mechanism. TXtract [9] makes use of hierarchical taxonomy of categories to help attribute value extraction. AdaTag [19] introduces adaptive decoder for each attribute to share knowledge across different attributes. However, all these models are sequence labeling models requiring annotation of position of attribute values in the product text to train them. In other words, they cannot be trained and used in these cases where annotation of position of values is not available. The above requirement introduces more workload of human annotation than just annotating the existence of attribute values. Moreover, in some real-world scenario, the products in a shopping platform or website are already weakly annotated with their attribute values by the merchants when they are created. Besides, above models ignore the semantic connection between attribute values of a product and its text (i.e., title and description). In other words, they do not consider the fact that all existing attribute values of a given product come from its title and description. This fact means that for a given product, the semantic meaning of all its existing attribute values can represent this unique product. For example, as shown in Fig. 1, all the attribute values of the product water bottle come from its title and description (red part), each of these attribute values is a short piece of its text. Intuitively, besides represented by the semantic meaning of its title and description, the product of water bottle can also be uniquely represented by the combination of all its attribute values. In others words, the combined semantic meaning of all its attribute values should be as close as possible to the semantic meaning of its text (i.e., title and description).

Based on above intuition, we reformulate product attribute value extraction as a multi-label text classification task which needs less human annotation as mentioned before and can help to make a direct semantic connection between attribute values of a product and its text. The multi-label classification method takes the title and description of a given product as input to predict multiple labels (i.e., attribute values) for this product.
HTCInfoMax [3] achieves good performance for multi-label text classification which is designed to predict multiple categories of a given piece of text such as news and scientific paper abstract. Different from that the attribute values of a given product are exactly part of its title and description, the categories of news or scientific paper abstract are summarization of the given text. Thus HTCInfoMax cannot be well suited, although can be applied, for attribute value extraction.
To address the limitation of HTCInfoMax for attribute value extraction, we propose a multi-label classification model with semantic matching and negative label sampling for attribute value extraction called AE-smnsMLC. Specifically, first, similar to HTCInfoMax, besides the text encoder, AE-smnsMLC introduces a label encoder to learn the semantic meaning of attribute values by taking their texts as input. Second, a semantic matching module is designed to make a direct connection between the text of a given product and its attribute values by pushing the representation of attribute values towards the representation of the product text. Third, because attribute values belonging to different attribute (e.g., “1 liter” of attribute “capacity” and “black” of attribute “color”) are easier for the model to distinguish compared with attribute values of the same attribute (e.g., “1 liter” and “2 liter” of the attribute “capacity”). Thus a negative label sampling method is devised to improve the model’s ability of distinguishing similar attribute values belonging to the same attribute. Our code is available at: https://github.com/zhongfendeng/AE-smnsMLC.
Our work’s main contributions are as follows: 1) To our best knowledge, this is the first work to design a multi-label classification model (AE-smnsMLC) for product attribute value extraction which introduces a label encoder to learn the semantic meaning of attribute values using their text. 2) We propose a semantic matching method in AE-smnsMLC to make a direct interaction between the text of a given product and its all attribute values. 3) We propose a negative label sampling method to help the model distinguish similar attribute values belonging to the same attribute. 4) Extensive experiments on three subsets of a real-world e-Commerce dataset demonstrate the effectiveness of our proposed model.
II Problem Formulation
Given the textual title and description of a product denoted as , is its length. The model aims at identifying multiple labels (i.e., attribute values) denoted as for this product, where is the number of attribute values this product has. Each of the labels in has a pre-defined index which can uniquely represent an attribute value. Suppose there are total attribute values (i.e., labels) in a dataset, the multi-label classification method for attribute value extraction aims at predicting multiple labels out of labels for each product.
III Methodology
III-A Overview
As both product text and labels (i.e., attribute values) contain textual information, our proposed model AE-smnsMLC makes use of both product text and label text information. Specifically, we design two encoders for encoding the product text and label text information respectively, which are text encoder and label encoder. Then a semantic matching approach with negative label sampling is designed to help both encoders learn better representations for product text and labels. The goal of semantic matching is to push the label feature of a given product towards its text feature which contains richer and complete information about the product. The overall architecture of our proposed method is shown in Fig. 2. The major components of our model includes text encoder, label encoder, semantic matching module, negative label sampling module, label selector and loss weight estimator. Other components such as predictor with attention mechanism and label prior matching are kept the same as HTCInfoMax [3].

III-B Text Encoder and Label Encoder
We use pre-trained BERT-base model as part of the text encoder which takes the product title and description as input. The output of BERT is fed into a convolutional layer which does a convolution operation along the sequence length and outputs the final text representation for the product denoted as , where is the length of sequence after convolution and is the dimension of hidden states of BERT.
Each label has specific text information describing its meaning. It is intuitive to make use of such text information to help learn better representations for labels. Thus, we introduce a label encoder which takes advantage of the label text information to learn the semantic representations for all labels, is the dimension of label embedding. The process and structure of the label encoder is shown in Fig. 3.

III-C Semantic Matching
Since the product text contains more and complete information about the product, it represents the semantic meaning of the current product well. While the text for each label is relatively short which consists of only several short pieces of text that are attribute name and specific attribute value. However, for each product, the representation learned from its all attribute values (i.e., labels) should also uniquely identify this product, thus we design a semantic matching method to push the labels’ representation of a product towards its text feature learned from the complete product text. In this way, the semantic matching can help both text encoder and label encoder learn better text and labels’ representation respectively during the training process.

The details of semantic matching method is shown in Fig. 4. The left half shows the overall process of semantic matching, while the right half shows the details of max pooling used in it. For the side of text feature, the semantic matching module applies mean pooling on to obtain pooled text feature . For the side of label feature, it first uses a label selector to get the ground truth labels’ representation for the current product from , is the number of ground truth labels of the current product. Then max pooling is used to obtain the combined label representation for the current product as shown in the right half of Fig. 4. After this, the semantic matching module calculates the similarity between the combined label embedding and the text feature of the current product, the goal is to push the label embedding as close as possible to the text feature in the feature space which can help the label encoder learn informative representation for all labels. Thus it can help the model enhance the ability of identifying multiple attribute values for each product correctly. We use cosine similarity here for simplicity. The semantic matching loss is is the total number of products (i.e., instances) in the training set.


III-D Negative Label Sampling
The labels (i.e., attribute values) belong to the same attribute may be difficult for models to distinguish because they are much closer to each other than other labels belong to a different attribute. Fig. 5 shows such an example. The label ”1 Liter” is much closer to the label “2 Liter” in the same attribute “Capacity” than to the label “Black” which belongs to another attribute “Color”. In other words, the difference between labels in the same attribute (i.e., intra-group difference) is smaller than the difference between labels belonging to two different attributes (i.e., inter-group difference).
To help the model enhance the ability of distinguishing similar labels (i.e., attribute values) belonging to the same attribute, we propose a negative label sampling method for training the model. The whole process of this method is shown in Fig. 6 and the specific sampling approach for negative labels is shown in Fig. 7. To be specific, we first construct a mapping dictionary which stores all the labels (i.e., attribute values) belong to the same attribute. Then the same label selector as in Fig. 4 is used to get the ground truth attribute names and corresponding labels for each of the attributes for the current product. They are taken as input for the negative label sampling algorithm together with the mapping dictionary . The output of this sampling algorithm is a list of sampled negative labels for the current product and it is utilized to obtain the negative labels’ feature matrix from all labels’ feature matrix . A mean pooling is applied on the negative labels’ feature to get the final compact negative labels’ representation for the current product. Similar to the semantic matching module, it calculates the similarity between this compact negative label representation and the pooled text feature of the current product. The goal is to push away the negative labels as far as possible to the current product which can help the model learn more discriminative feature representation for similar labels belong to the same attribute. Thus during the training process, the negative label sampling method aims at minimizing the similarity score between the negative labels and text feature of each product. The loss of negative label sampling is .
Input: The attribute name to attribute values dictionary , ground truth labels’ ID list and attribute names’ ID list of current product
Output: Negative labels’ ID list for current product
III-E Final Loss Function
III-E1 Loss of AE-smnsMLC
As stated before, the semantic matching method and negative label sampling method are proposed to train the model which introduce two losses that are semantic matching loss and loss of negative label sampling . The final loss is calculated in Eq. (1), where the label prior matching loss , loss weight and the binary cross entropy loss are inherited from HTCInfoMax.
(1) |
III-E2 Loss of the Variant AE-smMLC
In addition, we also design a variant of our proposed model called AE-smMLC which removes the negative label sampling module from AE-smnsMLC. Its final loss is .
IV Experiment and Analysis
IV-A Datasets and Evaluation Metrics
We use a real-world e-commerce dataset (it is not public due to commercial reasons) possessed by a commercial company to conduct experiments for all models. All the labels (i.e., attribute values) are annotated by the merchants when they create their products on the company’s website. Thus no additional human annotation is needed for training our model. The text and labels are in Japanese. Due to its large size (47 million products), we sample products from three different domain as three datasets whose statistics are shown in Table I. Evaluation metrics of multi-label classification including Precision (P), Recall (R), Micro-F1 (MiF1) and Macro-F1 (MaF1) are used to evaluate the performance of models.
Datasets | Domain | A | L | AvgL | Train | Val | Test |
Dataset 1 | Gardening | 12 | 154 | 5.83 | 4,256 | 69 | 60 |
Dataset 2 | Plants | 15 | 244 | 4.92 | 4,160 | 219 | 196 |
Dataset 3 | Gardening Tools | 33 | 218 | 1.24 | 4,160 | 1,254 | 1,268 |
Models | Dataset 1 (Gardening) | Dataset 2 (Plants) | Dataset 3 (Gardening Tools) | ||||||||||
P | R | MiF1 | MaF1 | P | R | MiF1 | MaF1 | P | R | MiF1 | MaF1 | ||
BERT-based Baselines | BERT+Linear Layer [5] | 80.23 | 39.43 | 52.87 | 17.37 | 73.82 | 26.01 | 38.47 | 11.75 | 83.48 | 47.40 | 60.47 | 15.88 |
BERT+CNN+Linear Layer [5] | 78.50 | 68.86 | 73.36 | 25.61 | 71.23 | 58.24 | 64.08 | 21.07 | 81.64 | 72.12 | 76.58 | 30.41 | |
HTC InfoMax -based Baselines | HTCInfoMax (TE:BERT) [3] | 66.78 | 58.57 | 62.40 | 19.20 | 53.30 | 38.45 | 44.67 | 13.78 | 75.85 | 65.08 | 70.05 | 25.28 |
HTCInfoMax (TE:BERT+CNN) [3] | 83.57 | 68.29 | 75.16 | 26.79 | 67.47 | 60.83 | 63.98 | 20.57 | 77.49 | 74.40 | 75.91 | 29.90 | |
HTCInfoMax (TE:BERT+CNN) -Pretrained CLS Label Emb [3] | 86.48 | 69.43 | 77.02 | 26.61 | 78.50 | 57.51 | 66.39 | 22.79 | 85.13 | 74.40 | 79.40 | 30.91 | |
Ours | AE-smMLC | 81.85 | 73.43 | 77.41 | 30.16 | 78.40 | 57.93 | 66.63 | 23.18 | 82.31 | 76.36 | 79.22 | 31.87 |
AE-smnsMLC | 81.82 | 74.57 | 78.03 | 29.37 | 75.91 | 62.38 | 68.49 | 23.78 | 85.42 | 76.11 | 80.50 | 30.41 |
IV-B Baselines
Sequence labeling models need labor-intensive and time-consuming human annotation for every token in every product’s text (i.e., title and description) which is not available in the stated dataset. Thus no sequence labeling models can be trained on the dataset. Therefore, we only consider classification models as the baselines. Specifically, there are two groups of baselines used to compare with our model AE-smnsMLC which are described as follows. Same as in our model, the parameters of the pre-trained BERT-base model used in all these baselines are fixed due to out of GPU memory issue of fine-tuning. The pre-trained BERT is trained with 10 million product descriptions in Japanese.
IV-B1 BERT-based baselines
This group uses the same pre-trained BERT-base model [5] as in our model to encode product text (i.e., title and description). BERT+Linear Layer feeds the CLS embedding outputted by BERT to a linear layer which predicts the product’s attribute values. BERT+CNN+Linear Layer utilizes an additional CNN layer on top of BERT. The output of CNN is fed to a linear layer to do prediction.
IV-B2 HTCInfoMax-based baselines
To be fairly compared with our model. All HTCInfoMax-based baselines [3] adopt the same pre-trained BERT-base model as text encoder or part of it, and the same label embedding layer as in the label encoder of our model is used as the structure encoder of HTCInfoMax. HTCInfoMax (TE:BERT) uses the pre-trained BERT as text encoder, the label embedding layer of the structure encoder is randomly initialized. HTCInfoMax (TE:BERT+CNN) stacks an additional CNN layer on top of BERT to form the text encoder. The label embedding layer of the structure encoder is also randomly initialized. HTCInfoMax (TE:BERT+CNN)-Pretrained CLS Label Emb has the same text encoder and label encoder as our model, the pretrained label embedding generated by BERT is used to initialize the label embedding layer.
IV-C Experimental Setting
All baselines and our model adopt the same setting for all hyper-parameters. To be specific, the maximum length of token sequence of product text is set to 256, the dimensions of text and label embedding are 768, the kernel size of CNN layer is 4. HTCInfoMax-based baselines are re-implemented by us using PyTorch based on the released code [3]. We implement our model and BERT-based baselines using PyTorch. All experiments are conducted on a single NVIDIA Quadro M6000 GPU sever with 12G GPU memory.
IV-D Experimental Results and Analysis
The experimental results of our model and baselines on the three datasets are shown in Table II. We can see that our proposed model AE-smnsMLC generally outperforms all baselines on Recall, Micro-F1 and Macro-F1 including the strong baseline in the group of HTCInfoMax-based baselines which utilizes pretrained CLS label embedding to initialize its structure encoder (i.e., the third from last row in Table II). Although this baseline has higher precision score on Gardening and Plants, our model obtains much better Recall, Micro-F1 and Macro-F1 scores by improvements of 7.40%, 1.31%, 10.37% on Gardening and by improvements of 8.47%, 3.16%, 4.34% on Plants respectively. This demonstrates that the semantic matching and negative label sampling methods help our model perform better on all labels, because Micro-F1 and Macro-F1 scores are more reliable metrics that evaluate models in a much more comprehensive way. Furthermore, our model outperforms the previously mentioned strongest baseline on Precision, Recall and Micro-F1 on Gradening Tools dataset by improvements of 0.34%, 2.30% and 1.39% respectively, which also verifies the superiority of our model. In addition, the consistent performance of our model across three different datasets demonstrates that our proposed model can learn better text representation and better representations for all attribute values (i.e, labels) by matching the semantic representation of all ground truth attribute values of a given product towards its text representation. In other words, the performance consistency validates the effectiveness of our proposed semantic matching method.
Besides, from the comparison between models in the group of HTCInfoMax-based baselines, we can see that pretrained label embedding generated by using the text (e.g., “stainless steel”) of attribute values can help improve the performance. This indicates that label text information is helpful for product attribute value extraction in terms of reformulating it as a multi-label classification task. In addition, the baselines with CNN layer performing much better than that without CNN indicates the importance of CNN layer in the text encoder.
IV-E Ablation Study
We conduct ablation studies to demonstrate the effect of different components in our model such as negative label sampling and pooling method. We also design a third variant of our model called “AE-smnsMLC w/o LabelPrior” which removes the label prior matching module for more ablation study. The results of ablation studies are shown in Table III. For each of the three models in this table, two pooling methods for ground truth labels in semantic matching module are experimented. One is max pooling as shown in Fig. 4, the other is mean pooling which replaces the max pooling.
Models | SM Pooling | P | R | MiF1 | MaF1 |
AE-smnsMLC w/o LabelPrior | Mean Pool | 86.10 | 72.57 | 78.76 | 28.91 |
Max Pool | 79.94 | 71.71 | 75.60 | 27.55 | |
AE-smMLC | Mean Pool | 76.07 | 70.86 | 73.37 | 26.96 |
Max Pool | 81.85 | 73.43 | 77.41 | 30.16 | |
AE-smnsMLC | Mean Pool | 79.17 | 76.00 | 77.55 | 27.28 |
Max Pool | 81.82 | 74.57 | 78.03 | 29.37 |
From Table III, one can see that max pooling is generally performing better than meaning pooling. And our models with (the last two rows) or without (the first two rows) label prior matching achieve comparably similar results, which verifies the effectiveness of our proposed model in capturing the semantic connections between attribute values of each product and its text. It also indicates that the label prior matching helps. Furthermore, the comparison between AE-smnsMLC and AE-smMLC in Table III (the last four rows) and Table II (the last two rows) shows that AE-smnsMLC generally performs better than AE-smMLC without negative label sampling. This demonstrates that negative label sampling can help the model distinguish similar labels by learning more discriminative feature for them and thus improves the performance.
IV-F Case Study
To further verify the usefulness of negative label sampling, we conduct a case study on Gardening dataset. Specifically, we select the prediction results of AE-smnsMLC and its variant AE-smMLC on all labels belong to some attribute (e.g., “Events/Holiday”) from Gardening dataset and calculate Micro-F1 scores for these labels. The comparison between the performance results of our model and its variant without negative label sampling on all labels belong to the attribute “Events/Holiday” is shown in Fig. 8. One can see that AE-smnsMLC performs better on almost all labels belong to this attribute which demonstrates that negative label sampling can help the model learn much more discriminative feature for labels belong to the same attribute and thus improve the model’s ability of distinguishing similar labels.

V Related Work
V-A Product Attribute Value Extraction and Multi-Label Text Classification
Earlier models for product attribute value extraction are rule-based extraction methods [6, 15]. Later on, [12] treats this task as a sequence labeling task and many neural network based sequence labeling models with different techniques such as attention mechanism, using hierarchical taxonomy of product, adaptive CRF decoder and so on are designed [21, 18, 11, 9, 19]. Most of multi-label text classification models can be categorized into two groups. One group is local models which build a classifier for each label or labels in the same level of the label taxonomy [16, 1, 7]. The other group is global models building one classifier for all labels [8, 10, 17, 13, 14, 22, 3]. The latter four models make use of label structure information. In addition, attention-based models are popular for text classification in recent years [20, 2, 4].
VI Conclusion
We propose a multi-label classification model for product attribute value extraction which can be applied for real-world scenario in which only attribute values are annotated without their position information in the product text. Our proposed model introduces a label encoder, a semantic matching and a negative label sampling method. Semantic matching aims to capture the semantic interactions between attribute values of a given product and its text. Negative label sampling helps enhance the model’s ability to distinguish similar labels. Experimental results on three subsets of a large-size real-world e-commerce dataset demonstrates the superiority of our model for product attribute value extraction.
Acknowledgment
We thank the reviewers for their comments and feedback. This work is supported in part by NSF under grants III-1763325, III-1909323, III-2106758, and SaTC-1930941.
References
- [1] Siddhartha Banerjee, Cem Akkaya, Francisco Perez Sorrosal, and Kostas Tsioutsiouliklis. 2019. Hierarchical transfer learning for multi-label text classification. In Proceedings of the 57th ACL, pages 6295–6300.
- [2] Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, and Inderjit S Dhillon. 2020. Taming pre-trained transformers for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD, pages 3163–3171.
- [3] Zhongfen Deng, Hao Peng, Dongxiao He, Jianxin Li, and Philip S. Yu. 2021. Htcinfomax: A global model for hierarchical text classification via information maximization. In Proceedings of the 2021 NAACL:HLT, pages 3259–3265.
- [4] Zhongfen Deng, Hao Peng, Congying Xia, Jianxin Li, Lifang He, and Philip Yu. 2020. Hierarchical bi-directional self-attention networks for paper review rating recommendation. In Proceedings of the 28th COLING, pages 6302–6314.
- [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL, pages 4171–4186.
- [6] Vishrawas Gopalakrishnan, Suresh Parthasarathy Iyengar, Amit Madaan, Rajeev Rastogi, and Srinivasan Sengamedu. 2012. Matching product titles using web-based enrichment. In Proceedings of the 21st ACM CIKM, pages 605–614.
- [7] Wei Huang, Enhong Chen, Qi Liu, Yuying Chen, Zai Huang, Yang Liu, Zhou Zhao, Dan Zhang, and Shijin Wang. 2019. Hierarchical multi-label text classification: An attention-based recurrent network approach. In Proceedings of the 28th ACM CIKM, pages 1051–1060.
- [8] Rie Johnson and Tong Zhang. 2015. Effective use of word order for text categorization with convolutional neural networks. In Proceedings of the 2015 NAACL:HLT, pages 103–112.
- [9] Giannis Karamanolakis, Jun Ma, and Xin Luna Dong. 2020. Txtract: Taxonomy-aware knowledge extraction for thousands of product categories. In Proceedings of the 58th ACL, pages 8489–8502.
- [10] Yuning Mao, Jingjing Tian, Jiawei Han, and Xiang Ren. 2019. Hierarchical text classification with reinforced label assignment. In Proceedings of the 2019 EMNLP-IJCNLP, pages 445–455.
- [11] Kartik Mehta, Ioana Oprea, and Nikhil Rasiwasia. 2021. Latex-numeric: Language agnostic text attribute extraction for numeric attributes. In Proceedings of the 2021 NAACL:HLT: Industry Papers, pages 272–279.
- [12] Ajinkya More. 2016. Attribute extraction from product titles in ecommerce. arXiv preprint arXiv:1608.04670.
- [13] Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song, and Qiang Yang. 2018. Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In Proceedings of the 2018 World Wide Web Conference, pages 1063–1072.
- [14] Hao Peng, Jianxin Li, Senzhang Wang, Lihong Wang, Qiran Gong, Renyu Yang, Bo Li, Philip Yu, and Lifang He. 2019. Hierarchical taxonomy-aware and attentional graph capsule rcnns for large-scale multi-label text classification. IEEE TKDE, 33(6): 2505-2519.
- [15] Damir Vandic, Jan-Willem Van Dam, and Flavius Frasincar. 2012. Faceted product search powered by the semantic web. Decision Support Systems, 53(3):425–437.
- [16] Jonatas Wehrmann, Ricardo Cerri, and Rodrigo Barros. 2018. Hierarchical multi-label classification networks. In ICML, pages 5075–5084.
- [17] Jiawei Wu, Wenhan Xiong, and William Yang Wang. 2019. Learning to learn and predict: A meta-learning approach for multi-label classification. In Proceedings of the 2019 EMNLP-IJCNLP, pages 4354–4364.
- [18] Huimin Xu, Wenting Wang, Xinnian Mao, Xinyu Jiang, and Man Lan. 2019. Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In Proceedings of the 57th ACL, pages 5214–5223.
- [19] Jun Yan, Nasser Zalmout, Yan Liang, Christan Grant, Xiang Ren, and Xin Luna Dong. 2021. Adatag: Multi-attribute value extraction from product profiles with adaptive decoding. arXiv preprint arXiv:2106.02318.
- [20] Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, and Shanfeng Zhu. 2019. Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In NeurIPS, pages 5820–5830.
- [21] Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, and Feifei Li. 2018. Opentag: Open attribute value extraction from product profiles. In Proceedings of the 24th ACM SIGKDD, pages 1049–1058.
- [22] Jie Zhou, Chunping Ma, Dingkun Long, Guangwei Xu, Ning Ding, Haoyu Zhang, Pengjun Xie, and Gong shen Liu. 2020. Hierarchy-aware global model for hierarchical text classification. In Proceedings of the 58th ACL, pages 1106–1117.
- [23] Wei-Te Chen, Yandi Xia, and Keiji Shinzato. 2022. Extreme Multi-Label Classification with Label Masking for Product Attribute Value Extraction. In Proceedings of ECNLP 5, pages 134-140.