Overview of CTC 2021: Chinese Text Correction for Native Speakers

Honghong Zhao State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China Baoxin Wang State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China Research Center for SCIR, Harbin Institute of Technology, Harbin, China Dayong Wu State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China Wanxiang Che Research Center for SCIR, Harbin Institute of Technology, Harbin, China Zhigang Chen State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China Shijin Wang State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China

Abstract

In this paper, we present an overview of the CTC 2021, a Chinese text correction task for native speakers. We give detailed descriptions of the task definition and the data for training as well as evaluation. We also summarize the approaches investigated by the participants of this task. We hope the data sets collected and annotated for this task can facilitate and expedite future development in this research area. Therefore, the pseudo training data, gold standards validation data, and entire leaderboard is publicly available online at https://destwang.github.io/CTC2021-explorer/.

1 Introduction

Chinese text correction (CTC) is a challenging task in natural language processing and it has attracted more and more concerns recently. We organized the CTC competition in 2021, with a focus on text errors produced by native Chinese speakers. In particular, our task is defined as to detect various errors in the text from native speakers and return the corrected texts.

The previous research on text errors in Chinese is mainly studied using texts written by Chinese-as-a-second-language (CSL) learners, with a focus on grammatical errors or spelling errors Wu et al. (2013); Yu et al. (2014); Tseng et al. (2015); Zhao et al. (2018); Rao et al. (2020); Wang et al. (2021). However, most of the errors in Chinese learner texts seldom appear in texts written by native speakers. Therefore, CTC 2021 collects texts written by native Chinese speakers from the Internet to evaluate the performance of the trained models. These texts are more complex, and the errors are more diverse, including spelling error, grammatical error, and Chinese semantic error. In addition, the corresponding corrections are more rigorous.

The goal of the task is to develop techniques to automatically detect and correct errors made by native Chinese speakers. We provide large-scale pseudo training data, and we release the validation data in which errors have been annotated by native speakers. Blind testing data is used to evaluate the outputs of the participating teams using a common scoring script and evaluation metric.

A total of 124 teams signed up for the task, 42 of them reached the final, and 20 of them submitted final systems. This overview paper provides detailed descriptions of the task and it is organized as follows. Section 2 gives the task definition. Section 3 presents a detailed introduction of the data sets and our pseudo data construction method. Section 4 provides the evaluation metric and Section 5 reports the results of the participants’ approaches. Conclusions are finally drawn in Section 6.

2 Task Description

CTC 2021 evaluates Chinese text correction performance on Internet texts written by native Chinese speakers. The participants should detect and correct errors in given texts written by native Chinese speakers. Each character or punctuation mark occupies 1 spot for counting location. The input instance is given a unique passage number “PID”. If the text contains no spelling errors, the checker should return “PID, -1”. If an input text contains at least one error, the output format is “PID [, location, error type, detect word, correct word] +”, where the symbol “+” indicates there is one or more instance of the predicting element “[, location, error type, incorrect word, correct word]”. “Location” denotes the start location of incorrect word. “Error type” can be divided into three coarse categories and seven fine-grained categories, shown in Table 1. “Incorrect word” and “correct word” respectively denote the continuous incorrect characters and its correct version. Table 2 presents some examples. There are two errors in Ex. 1, the $20^{th}$ character "轮" should be "论", Location “-1” denotes that there is no error in Ex. 2, in Ex. 3 missing a character "供" in location 13, and the $26^{th}$ or $27^{th}$ character "都" is redundant.

coarse categories	fine-grained categories
spelling error	character error
spelling error	word error
grammatical error	missing error
	redundant error
	disordered error
Chinese semantic error	semantic repetition
Chinese semantic error	syntactic hybridity

Table 1: Error type categories.

Example 1
input	PID=0011-1 关于瑞典时装公司HM拒绝使用新疆产品的言轮在华引发广泛声讨和抵制浪潮，有记者就此提问。华春莹标识：
output	PID=0011-1, 20, character error, 轮, 论, 46, word error, 标识, 表示,
Example 2
input	PID=0011-2 新疆棉花是世界上最好的棉花之一，不用是相关企业的损失；
output	PID=0011-2, -1
Example 3
input	PID=0011-3 给老百姓包括少数民族群众提更多的就业机会，一般正常人都都会觉得是件好事。
output	PID=0011-3, 13, missing error, , 供, 27, redundant error, 都, ,
Example 4
input	PID=0011-4 因为他们自己上历史真的就这么干了上百年，所以现在以己度人；
output	PID=0011-4, 6, disordered error, 上历史, 历史上,
Example 5
input	PID=0023-1 对学校的未来发展，专家们提出了许多真知灼见的意见。
output	PID=0023-1, 21, semantic repetition, 的意见, ,
Example 6
input	PID=0069-1 高速公路上交通事故的主要原因是司机违反交通规则或操作不当造成的。
output	PID=0069-1, 29, syntactic hybridity, 造成的, ,

Table 2: Some examples used in our task.

3 Data Preparation

This section presents the pseudo training data, validation data, and testing data in our task. The texts used in our task were collected from the Internet, including education, science, technology, and other types of data, the collected texts were written by native speakers. Table 3 shows statistics for the data set.

	`#`Texts	`#`ErrText	AvgLen
Train	217,634	217,630	53.51
Valid	969	480	48.93
Test	967	466	50.63

Table 3: The distributions of error types in validation data.

3.1 Training data

We randomly select 217,634 texts from collected Internet texts to create pseudo training data. We randomly choose one position or two for error in each text. If the $i^{th}$ word is chosen, (1) replace it with a pinyin similar word, a shape similar word or a random word (2) delete it (3) insert a word (4) swap it with an adjacent word. Then we get texts with incorrect characters and corresponding corrections.

3.2 Validation and testing data

The validation data and testing data use pseudo data construction method same as training data. Finally, we select 969 texts as validation data, 967 as testing data. To obtain gold edits of errors, two annotators annotated these text errors. The error types distribution of validation data is shown in Table 4, the testing data has a similar distribution.

error type	error number
spelling error	280
grammatical error	158
Chinese semantic error	100

Table 4: The distributions of error types in validation data.

4 Performance Metrics

Table 5 shows the confusion matrix used for performance evaluation. In the matrix, TP (True Positive) is the number of errors that are correctly identified by the checker; FP (False Positive) is the number of wrongly identified errors that are non-existent; TN (True Negative) is the number of sentences without any errors which are correctly identified as such; FN (False Negative) is the number of errors which are not detected.

Confusion Matrix		System Result
Confusion Matrix		Positive	Negative
Gold Standard	Positive	TP	FN
Gold Standard	Negative	FP	TN

Table 5: Confusion matrix for evaluation.

Correctness is determined at two levels. The error type does not affect the evaluation results.

(1) Detection level: all locations and incorrect characters in a given text should be completely identical with the gold standard. It is noteworthy that there are some errors shown with different location and incorrect character, but the correction text is right, this is also right detection. For example, “PID=0011-3, 26, redundant error, 都, ,” is completely identical with “PID=0011-3, 27, redundant error, 都, ,”.

(2) Correction level: all locations, incorrect characters, and corresponding corrections should be completely identical with the gold standard, or the correction text is completely identical with the gold standard.

The following metrics are measured at both levels with the help of the confusion matrix.

$Precision=TP/(TP+FP)$

$Recall=TP/(TP+FN)$

$F1=2*Precision*Recall/(Precision+Recall)$

Finally, we calculate overall F1 by weighted detection level F1 and correction level F1, defined as:

$F1_{overall}=0.8*F1_{detection}+0.2*F1_{correction}$ .

Take for example, testing input with gold standards and the system may output the result shown in Table 6, the evaluation tool will yield the following performance.

Testing output with gold standards
PID=0011-1, 20, character error, 轮, 论, 46, word error, 标识, 表示,
PID=0011-2, -1
PID=0011-3, 13, missing error, , 供, 27, redundant error, 都, ,
PID=0011-4, 6, disordered error, 上历史, 历史上,
PID=0023-1, 21, semantic repetition, 的意见, ,
PID=0069-1, 29, syntactic hybridity, 造成的, ,
System output result
PID=0011-1, 20, character error, 轮, 语,
PID=0011-2, -1
PID=0011-3, 26, redundant error, 都, , 32, character error, 件, 个,
PID=0011-4, 6, redundant error, 上, ,
PID=0023-1, -1
PID=0069-1, 29, syntactic hybridity, 造成的, ,

Table 6: Testing input with gold standards and System output result. Dashed underline indicates detection is right but correction is wrong, wavy underline indicates both detection and correction are right.

Detection-level:

$Pre.=0.6(=3/5)$

$Rec.=0.4286(=3/7)$

$F1=0.5(=2*0.6*0.4286/(0.6+0.4285))$

Correction-level:

$Pre.=0.4(=2/5)$

$Rec.=0.2857(=2/7)$

$F1=0.3333(=2*0.4*0.2857/(0.4+0.2857))$

Overall:

$F1=0.4667(=0.8*0.5+0.2*0.333)$

5 Evaluation Results

A total of 124 teams signed up for the task, 42 of them reached the final, and 20 of them submitted final systems. The 20 submitting system teams are all from universities and industries in China. In the official testing phase, each participating team was allowed to submit at most three runs that adopt different models or parameter settings, the highest score of the three times will be the final score of the team.

In total, we had received 36 runs. Of the 20 submission teams, 16 team systems can work normally. Table 7 shows the submission statistics for the 16 work normally teams and their final score. we can see that Chinese text correction for native speakers is a challenging task. There remain large gaps between submitting systems and gold standards. In detail, S&A gets the best detection F1 score of 0.6800 and best correction F1 score of 0.6460.

Team	`#`Runs	Detection level			correction level			Overall F1
		Pre	Rec	F1	Pre	Rec	F1
S`&`A	3	0.6869	0.6733	0.6800	0.6525	0.6396	0.6460	0.6732
改的都队	3	0.6890	0.5703	0.6241	0.6316	0.5228	0.5720	0.6137
znv`_`sentosa	1	0.4900	0.6277	0.5503	0.3833	0.4911	0.4306	0.5264
C`&`L	3	0.5927	0.4495	0.5113	0.5640	0.4277	0.4865	0.5063
MDatai	3	0.5584	0.4733	0.5123	0.5164	0.4376	0.4737	0.5046
YCC	3	0.4932	0.5030	0.4980	0.4233	0.4317	0.4275	0.4839
NJU-NLP	3	0.5448	0.4455	0.4902	0.4407	0.3604	0.3965	0.4715
四条人	3	0.5361	0.3386	0.4150	0.4608	0.2911	0.3568	0.4034
ai编程的小拓	2	0.4648	0.3267	0.3837	0.3831	0.2693	0.3163	0.3702
zybank	1	0.4579	0.3228	0.3786	0.4017	0.2832	0.3322	0.3693
华夏-龙盈战队	2	0.2550	0.3267	0.2865	0.1947	0.2495	0.2188	0.2730
yl`_`test	1	0.4608	0.1861	0.2652	0.2941	0.1188	0.1693	0.2460
晓梦	1	0.3113	0.1584	0.2100	0.2101	0.1069	0.1417	0.1963
only-one	1	0.365	0.1446	0.2071	0.2550	0.1010	0.1447	0.1946
zndx纠错好难	1	0.3179	0.1228	0.1771	0.1744	0.0673	0.0971	0.1611
CTC 2021 baseline	1	0.3361	0.0792	0.1282	0.1849	0.0436	0.0705	0.1167
DAWN	1	0.0384	0.1802	0.0633	0.019	0.0891	0.0313	0.0569

Table 7: Submission statistics and results of CTC 2021 in detection level and correction level. CTC 2021 baseline is produced with GECToR, and trained on our CTC 2021 pseudo training data.

Team Approach Linguistic Resources

S&A

Bert+sequence tagging

GECToR

Rule-based models

WeiXin public corpus

NLP Chinese corpus

Wikimedia

SIGHAN

Lang-8

Dynamic Corpus of HSK

改的都队

GECToR

Transformers

ReaLise

Rule-based models

WuDaoCorpora 2.0

Wikimedia

znv_sentosa

GECToR

CTC2021

Idiom Example Sentences

College Entrance Examination Error Sentences

C&L

BERT+CRF+MLM

SIGHAN

HybridSet (Wang et al., 2018)

XinHua News Corpus

CTC2021

MDatai

Electra+sequence tagging

ReaLiSe

RoBERTa MLM

CGED

NLPCC2018

SIGHAN

People’s Daily Corpus

YCC

ReaLiSe

BERT+CRF+MLM

Rule-based models

SIGHAN

CTC2021

Table 8: A summary of participants’ developed systems.

We collected the top 6 participants’ reports of their systems. Table 8 summarizes their approaches and the usage of linguistic resources for this task. We can observe that most of the participants adopt pre-trained language models (e.g. Devlin et al., 2018; Liu et al., 2019; Clark et al., 2020), sequence tagging (Omelianchuk et al., 2020), multimodal information of the Chinese characters (Xu et al., 2021), sequence to sequence models (Vaswani et al., 2017), and rule-based models. In addition to the CTC 2021 pseudo training data, some public linguistic resources are used by participants, such as Weixin corpus¹¹1https://github.com/nonamestreet/weixin_public_corpus, Wikimedia, SIGHAN, NLPCC2018, Lang-8, and WuDaoCorpora 2.0²²2https://resource.wudaoai.cn/home.

6 Conclusion

This paper provides an overview of CTC 2021, including task design, data preparation, evaluation metrics, and evaluation results. The final results show that it is still a challenging task which deserves more concern. In order to provide a good communication platform for researchers, industrial practitioners and NLP enthusiasts, we create CTC2021 leaderboard, release the pseudo training data and gold standards validation data. We hope this task can facilitate and expedite future development in this research area.

Acknowledgments

CTC 2021 was hosted by the Chinese Association for Artificial Intelligence, sponsored by iFLYTEK CO.LTD., organized by the Joint Laboratory of HIT and iFLYTEK Research (HFL).

References

Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Omelianchuk et al. (2020) Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. Gector–grammatical error correction: tag, not rewrite. arXiv preprint arXiv:2005.12592.
Rao et al. (2020) Gaoqi Rao, Erhong Yang, and Baolin Zhang. 2020. Overview of nlptea-2020 shared task for chinese grammatical error diagnosis. In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pages 25–35.
Tseng et al. (2015) Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and Hsin-Hsi Chen. 2015. Introduction to sighan 2015 bake-off for chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, pages 32–37.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Wang et al. (2021) Baoxin Wang, Wanxiang Che, Dayong Wu, Shijin Wang, Guoping Hu, and Ting Liu. 2021. Dynamic connected networks for chinese spelling check. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2437–2446.
Wang et al. (2018) Dingmin Wang, Yan Song, Jing Li, Jialong Han, and Haisong Zhang. 2018. A hybrid approach to automatic corpus generation for chinese spelling check. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2517–2527.
Wu et al. (2013) Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at sighan bake-off 2013. In SIGHAN@ IJCNLP, pages 35–42. Citeseer.
Xu et al. (2021) Heng-Da Xu, Zhongli Li, Qingyu Zhou, Chao Li, Zizhen Wang, Yunbo Cao, Heyan Huang, and Xian-Ling Mao. 2021. Read, listen, and see: Leveraging multimodal information helps chinese spell checking. arXiv preprint arXiv:2105.12306.
Yu et al. (2014) Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of sighan 2014 bake-off for chinese spelling check. In Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 126–132.
Zhao et al. (2018) Yuanyuan Zhao, Nan Jiang, Weiwei Sun, and Xiaojun Wan. 2018. Overview of the nlpcc 2018 shared task: Grammatical error correction. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 439–445. Springer.