\equalcont

These authors contributed equally to this work. [3]\fnmJungyeul \surPark 1]\orgdivDepartment of Asian Studies, \orgnameThe Pennsylvania State University, \orgaddress\street201 Old Main, \cityUniversity Park, \postcode16802, \statePA, \countryUSA 2]\orgdivGraduate School of Culture Technology, \orgnameKorea Advanced Institute of Science & Technology, \orgaddress\street291 Daehak-ro, \cityDaejeon, \postcode34141, \countrySouth Korea 3]\orgdivDepartment of Linguistics, \orgnameThe University of British Columbia, \orgaddress\street2613 West Mall, \cityVancouver, \postcodeV6T 1Z4, \stateBC, \countryCanada

Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring

\fnmJayoung \surSong jayoung.song@psu.edu \fnmKyungTae \surLim ktlim@kaist.edu jungyeul@mail.ubc.ca [ [ [

Abstract

Despite growing global interest in Korean language education, there remains a significant lack of learner corpora tailored to Korean L2 writing. To address this gap, we enhance the KoLLA Korean learner corpus by adding multiple grammatical error correction (GEC) references, thereby enabling more nuanced and flexible evaluation of GEC systems, and reflects the variability of human language. Additionally, we enrich the corpus with rubric-based scores aligned with guidelines from the Korean National Language Institute, capturing grammatical accuracy, coherence, and lexical diversity. These enhancements make KoLLA a robust and standardized resource for research in Korean L2 education, supporting advancements in language learning, assessment, and automated error correction.

keywords:

Korean learner corpus, Grammatical error correction, Multi-reference GEC, Rubric-based scoring, KoLLA corpus

1 Introduction

Written and spoken data produced by second language (L2) learners have long served as important resources for studying L2 acquisition. With the development of electronic collections of learner data—known as learner corpora—researchers now benefit from more systematic and scalable analysis methods. These corpora, often large and representative, offer a stronger empirical foundation than smaller, manually compiled datasets. They improve the reliability of research by enabling large-scale analysis of learner errors, interlanguage patterns, and developmental stages of acquisition [25]. In addition, advanced computational tools make it possible to perform quick and detailed analyses, facilitating broader investigations into L2 learning [14].

The increasing availability of learner corpora and reliable automated analytical tools has facilitated a broad range of empirical studies. These studies leverage corpus linguistic methods to investigate diverse aspects of L2 acquisition, such as error patterns, lexical diversity, and syntactic complexity. For instance, error editors and annotation tools allow researchers to insert detailed error information into text files, providing insights into the specific challenges learners face at various proficiency levels. Tools like the UCLouvain Error Editor and automatic annotation frameworks such as ERRANT have been instrumental in standardizing and automating the error annotation process across different languages [1, 2].

While learner corpus research has significantly advanced for widely taught languages like English and Chinese—particularly in the area of grammatical error correction (GEC)—Korean remains underrepresented in this domain. This gap is notable given the rapid global expansion of Korean language education, driven by the international popularity of Korean culture, including K-pop, dramas, and films. The growing number of Korean L2 learners highlights the urgent need for robust linguistic resources to support their language development. According to the Modern Language Association, Korean ranked among the top ten most-taught languages in U.S. higher education institutions between 2013 and 2016, underscoring its increasing global significance [16].¹¹1https://www.mla.org/content/download/191324/file/Enrollments-in-Languages-Other-Than-English-in-US-Institutions-of-Higher-Education-Fall-2021.pdf?utm_source=chatgpt.com

Existing Korean learner corpora, such as KoLLA (Korean Learner Language Annotations) and the KyungHee Korean Learner Corpus, have made notable contributions to this field. KoLLA, developed through open writing assignments by Korean L2 learners, has been used to study grammatical errors and learner writing performance [4, 12]. Despite these efforts, the scope and utility of these corpora remain limited due to the absence of multi-reference GEC (GEC) data and rubric-based scoring aligned with Korean language education standards.

In this paper, we present an enhanced version of the existing KoLLA corpus, originally developed by Indiana University and Wellesley College. Our work systematically augments KoLLA in two key ways. First, we introduce a layer of multi-reference grammatical error corrections (GEC), transforming the dataset into a multi-reference GEC corpus. This design aligns with best practices in GEC research, where acknowledging linguistic variability across corrections is critical for robust system evaluation. Second, we add rubric-based scores using standardized criteria issued by the Korean National Language Institute. These rubric scores offer a more nuanced and structured approach to writing assessment, supporting comparative evaluation and instructional design.

The enriched KoLLA corpus consists of 100 essays written by Korean L2 learners, including both heritage and non-heritage speakers, at beginner and intermediate levels. These essays yield 1,419 segmented sentences, with an average of approximately 14 sentences per learner and one essay per participant. Each sentence is accompanied by two reference corrections, constituting a total of 2,838 GEC references, making KoLLA the first publicly available Korean learner corpus with multi-reference GEC and standardized rubric scores.

This paper introduces these enhancements, clarifies the rationale behind multi-reference annotations and rubric standardization, and outlines the procedures followed to ensure annotation quality and inter-annotator reliability. We further clarify that this work constitutes a significantly enriched version of the original KoLLA dataset, and should be considered a new release that builds upon and extends the earlier resource.

2 Previous Works

GEC and Automated Essay Scoring (AES) have been extensively researched for widely taught languages such as English and Chinese, leading to notable advancements in learner corpus development, grammatical error annotation [1, 17, 6], and the incorporation of holistic and rubric-based scoring frameworks [20, 9]. These studies have contributed resources for understanding the linguistic challenges faced by L2 learners and enhancing tools for language education. However, research and resources for Korean GEC and AES remain scarce, with only a limited number of learner corpora and studies dedicated specifically to the Korean language.

In the context of GEC, the development of annotated learner corpora has played a critical role in enabling detailed analyses of learner errors and supporting the development of correction systems. For instance, large-scale learner corpora like the First Certificate in English (FCE) corpus, a subset of the Cambridge Learner Corpus (CLC) for English [23] and the Multi-Reference Multi-Source Evaluation Dataset (MuCGEC) for Chinese [26] have provided the foundation for building GEC systems that can address a wide range of grammatical issues. These corpora typically include corrections with or without error annotations allowing us to analyze error patterns across proficiency levels and develop targeted instructional strategies.

In the context of AES, the development of annotated datasets has been instrumental in facilitating the evaluation of writing proficiency and supporting the design of automated scoring systems. For example, the Automated Student Assessment Prize (ASAP) dataset has been a foundational resource in AES research. ASAP provides a collection of student essays scored according to rubric scores across several prompts. This dataset has enabled us to train and evaluate models that assess the overall quality of essays, taking into account factors such as coherence, fluency, and vocabulary. To further enrich this resource, the ASAP++ dataset was introduced, enhancing the original ASAP dataset with additional essay attribute scores, such as sentence structure, word choice, and organization [18]. These fine-grained rubric score annotations allow for a more detailed analysis of writing attributes and enable systems to provide more nuanced feedback to learners. Similar to learner corpora in GEC, datasets like ASAP and ASAP++ provide the foundation for empirical studies and practical applications in AES. These resources support the development of automated scoring systems capable of assessing diverse linguistic features and addressing the needs of both learners and educators.

For Korean, however, relatively few studies have focused on developing learner corpora for GEC [24] and AES [16]. Among the notable exceptions are the KoLLA and Kyunghee Korean learner corpora, both of which have contributed to the field but remain limited in scope and scale. The KoLLA corpus, developed by Indiana University and Wellesley College, represents one of the earliest efforts to construct a Korean learner corpus with detailed error annotations. It comprises approximately 12,000 words and 1,419 sentences drawn from 100 essays, categorized by learner proficiency (beginner and intermediate) and learner type (heritage and non-heritage) [15]. KoLLA places particular emphasis on annotating errors in functional morphemes—such as postpositions and verbal endings—which are widely recognized as especially challenging for learners of Korean. The Kyunghee Korean learner corpus [19], is a significant large-scale resource containing 3,700 writing samples from Korean language learners across various proficiency levels. It offers insights into linguistic features that characterize L2 learners’ proficiency levels. This corpus has been employed in tasks such as POS tagging, language modeling, and syntactic parsing, contributing to a deeper understanding of learner language. However, the corpus lacks essential developments in error annotation and multi-reference GEC data, which are crucial for advancing GEC and AES systems. Recent studies, such as Lim \BOthers. [16], have explored its potential for scoring systems, but the corpus remains limited by offering only holistic scores instead of detailed rubric-based scores. This limitation restricts its utility in capturing broader linguistic variability and developing more robust evaluation frameworks.

This study aims to address these limitations by enhancing the KoLLA corpus to create a more comprehensive resource for Korean GEC and AES. Specifically, we augment the corpus with multi-reference GEC and rubric scores aligned with the guidelines of the Korean National Language Institute. These enhancements position the KoLLA corpus as a robust and standardized evaluation dataset for Korean L2, bridging the gap between current resources and the increasing demand for advanced Korean learner corpus research.

3 Development and Annotation

This section introduces two key datasets that enhance the KoLLA learner corpus: a multi-reference GEC dataset and a rubric-based AES dataset. These additions provide insights into Korean L2 learning and assessment, addressing critical gaps in existing resources. The multi-reference GEC dataset incorporates multiple corrected versions for each sentence, capturing linguistic variability and enabling more comprehensive grammatical error analysis. The rubric-based AES dataset, aligned with the Korean National Language Institute guidelines, transitions from holistic scoring to detailed rubric-based evaluations, improving the reliability of L2 writing assessments. Together, these datasets establish the KoLLA corpus as a robust resource for advancing Korean L2 acquisition research, supporting innovations in language learning, teaching, and automated assessment.

3.1 Introduction to KoLLA

The KoLLA learner corpus addresses the challenge of annotating and analyzing Korean post-positional particle errors, a major difficulty for learners of Korean as a L2 [15]. These particles, akin to English prepositions, serve multiple functions, including marking grammatical roles (e.g., subject and object), semantic roles (e.g., agent and location), and discourse functions. The agglutinative nature of Korean and the tendency for particles to combine with nominals without whitespace complicate their correct use for learners. The KoLLA corpus annotates essays written by college-level Korean learners, grouped by proficiency (beginner and intermediate) and learner type (heritage and non-heritage). The annotation scheme is multi-layered, addressing spacing and spelling errors to ensure proper segmentation of words before particle errors are classified. Errors are categorized into four types: omission, substitution, addition, and ordering. To determine the correct particles for annotation, the corpus adopts the principle of “minimal interaction,” focusing on grammatical forms expected in formal Korean writing. Additionally, it accounts for cases where multiple particles may be equally valid, providing a set of possible correct forms for robust system evaluation. Developed using 100 learner essays and tools like EXMARaLDA [21], the corpus achieves high inter-annotator agreement rates, validating its guidelines and methodology. By supporting automatic particle error detection, KoLLA aims to advance the analysis of morphologically rich learner languages. The resource also contributes to broader computational linguistic efforts by offering insights into annotation best practices for error detection and language learning applications.

The dataset used in this study is based on the original KoLLA corpus, which consists of 100 essays written by Korean L2 learners. These essays were segmented into 1,419 individual sentences to align with the differing evaluation levels required by AES and GEC systems. While AES operates at the essay level, GEC is typically conducted at the sentence level, thus necessitating this segmentation.

To support multi-reference evaluation for GEC, each of the 1,419 sentences is accompanied by two human-generated corrections. The first correction is sourced from the original KoLLA corpus, while the second was newly created by a trained linguist following detailed annotation guidelines. This dual-reference design enhances the resource’s ability to capture the range of valid corrections—especially important in a morphologically rich language like Korean. For AES, two annotators independently assigned rubric-based scores to all 100 essays, evaluating dimensions such as content, coherence, grammatical accuracy, and vocabulary use.

The present version constitutes a substantial enrichment of the original KoLLA corpus, rather than an entirely new resource. While the learner-produced texts are drawn from the original KoLLA dataset, this version introduces additional multi-reference GEC annotations and double-coded AES scoring, following best practices in learner corpus construction and evaluation. The enriched KoLLA resource thus supports more comprehensive investigations into both grammatical error correction and essay scoring for Korean as a foreign language.

3.2 Multi-reference GEC corpus

The KoLLA corpus, consisting of open writing assignments from Korean L2 learners, serves as the foundation for our enhanced multi-reference GEC dataset. The original corpus includes 1,419 sentences written by beginner and intermediate-level learners, segmented from essays to facilitate sentence-level GEC annotation, given the difficulty of automatic boundary detection in the presence of learner errors.

To create a multi-reference GEC resource, we augmented the original KoLLA corrections with a second human-generated correction for each sentence. This design follows precedents in English and Chinese GEC corpora (e.g., FCE, MuCGEC), where multiple references are employed to capture the inherent ambiguity and variability of learner errors. For Korean, such variability is even more pronounced due to its morphosyntactic flexibility, honorific system, and case-marking ambiguity. In agglutinative languages, a single correction may fail to represent the full range of valid alternatives, potentially penalizing correct but divergent model outputs.

The second layer of annotation was produced by a trained linguist holding an M.A. in Korean Language Education. This annotator followed detailed guidelines to ensure fluency, preserve the learner’s intended meaning, maintain structural fidelity, and avoid unnecessary rewriting. Although the second set of corrections was provided by a single annotator, quality assurance focused on internal consistency through strict adherence to annotation guidelines.

The resulting dataset thus provides two reference corrections per sentence, offering a more realistic and flexible evaluation framework for Korean GEC. Multi-reference annotation improves the corpus’s ability to capture correction variability and supports more realistic evaluation settings for Korean GEC.

Grammatical error typology provides a principled framework for systematically identifying, classifying, and analyzing learner errors, serving as a foundation for building and evaluating GEC systems. Building on the original MRU framework (Missing, Replacement, and Unnecessary) introduced by errant [1], we extend it for Korean by incorporating language-specific error types and annotation strategies, formalized in Algorithm 1.

A key extension involves the treatment of word boundary errors, which are prevalent in agglutinative languages such as Korean. We introduce two fine-grained categories: M:WB (missing word boundaries) and U:WB (unnecessary word boundaries). M:WB is applied when multiple lexical items are erroneously fused into a single token (typically due to omitted spacing), while U:WB is used when excessive spacing disrupts a morphosyntactically valid token.

Spelling errors are labeled as R:SPELL when a content word is incorrectly realized orthographically or phonologically. We further expand the framework to explicitly address errors involving functional morphemes—especially postpositions (ADP) and verbal endings (PART). If the associated content word is correct, such errors are marked according to their context: a missing postposition is annotated as R:POS_i $\rightarrow$ POS_j+ADP, while an unnecessary postposition is marked as R:POS_i+ADP $\rightarrow$ POS_j. Substitution of one postposition for another is marked as R:POS+ADP_i $\rightarrow$ POS+ADP_j. When the associated content word is itself misspelled or incorrect, the error is subsumed under R:SPELL, and the morpheme is not annotated independently.

To mitigate segmentation mismatches and annotation inconsistencies, preprocessing was applied to normalize common variants in particle use and compound verb boundaries. This step reduced word boundary misalignment errors to under 5%.

Finally, word order errors are annotated as R:ORDER when the source and target sentences contain the same lexical items but differ in sequence, indicating syntactic reordering without lexical substitution.

Algorithm 1 formalizes this annotation logic and accounts for the structural and orthographic complexities characteristic of Korean grammatical errors.

Algorithm 1 Error annotation procedure

1: Given a source and target sentence, perform the following checks:

2: if Associated content word is correct then

3: if ADP or PART is present unnecessarily in the source sentence then

\mathcal{A}=

R:POS_i+ADP

\rightarrow

POS_j

5: else if ADP or PART is missing in the source sentence then

\mathcal{A}=

R:POS_i

\rightarrow

POS_j+ADP

7: else if ADP or PART is incorrect (substitution needed) then

\mathcal{A}=

R:POS+ADP_i

\rightarrow

POS+ADP_j

9: end if

10: else if Associated content word is incorrect then

11:

\mathcal{A}=

R:SPELL

12: end if

13: if Word boundary (WB) error occurs then

14: if Words w_iw_j in source should be separate in target then

15:

\mathcal{A}=

M:WB

16: else if Words w_i

\sqcup

w_j in source should be merged in target then

17:

\mathcal{A}=

U:WB

18: end if

19: end if

20: if Word order error occurs (w_i

\sqcup

w_j in source vs. w_j

\sqcup

w_i in target) then

21:

\mathcal{A}=

R:ORDER

22: end if

Figure 1 presents annotations from two annotators for the sentence “비행기 음식이 안 막였습니다” (‘The airplane food didn’t …’). The noun phrase 음식이 eumsig-i (‘food.nom’) is replaced with 음식을 eumsig-eul (‘food.acc’), annotated as R:NOUN+ADP $\rightarrow$ NOUN+ADP, reflecting a case marker substitution from nominative to accusative. The verb 막였습니다 magyeossseubnida, a misspelling based on phonetic and orthographic similarity, is corrected by Annotator 0 as 먹었습니다 meogeossseubnida (‘ate’) and annotated as R:SPELL. Annotator 1, however, interprets the intended meaning differently and replaces the verb with 맞았습니다 maj-assseubnida (‘agreed’), also marked as R:SPELL.

This example highlights a central challenge in grammatical error annotation: distinguishing true grammatical errors from alternative, semantically divergent but grammatically valid expressions. While both corrections are linguistically plausible, they lead to different interpretations of the sentence—either I didn’t eat the airplane food or The airplane food didn’t agree with me—underscoring the importance of consistency and clarity in annotation guidelines.

S ₀비행기 ₁음식이 ₂안 ₃막였습니다 ₄.

A 1 2|||R:NOUN+ADP

\rightarrow

NOUN+ADP|||음식을|||REQUIRED|||-NONE-|||0

A 3 4|||R:SPELL|||먹었습니다|||REQUIRED|||-NONE-|||0

A 3 4|||R:SPELL|||맞았습니다|||REQUIRED|||-NONE-|||1

Figure 1: Examples from the Korean M2 file: Annotator 0 corrected the sentence as I didn’t eat the airplane food, while Annotator 1 corrected it as The airplane food didn’t agree with me, resulting in divergent annotation outcomes. For illustrative purposes, index numbers are included to refer to word positions in the source sentence. These indices are not present in the original M2 files but are added here to facilitate discussion, comparison, and reference.

These annotation strategies and enhancements make the KoLLA corpus a robust and linguistically informed resource for Korean GEC. By addressing the structural nuances of the language and capturing correction variability, this enriched dataset offers a valuable foundation for developing and evaluating GEC systems tailored to Korean L2 writing.

3.3 Rubric-based scoring dataset

The relationship between first language (L1) and L2 writing has been extensively explored by L2 writing researchers [11, 13, 22]. Recent studies have found that although there are differences in L1 and L2 writing in terms of organizational patterns, overall quality, and grammatical errors, there are also significant similarities between L1 and L2 writing [10]. For example, Fathman \BBA Kobayashi [5] discovered that many Japanese learners of English used a specific-to-general organizational pattern (inductive), which contrasts with American L1 students’ general-to-specific writing styles (deductive). However, advanced Japanese ESL students tended to favor the deductive organizational pattern, indicating that L2 writing experience or a certain proficiency level leads to the adoption of organizational patterns preferred in L1 English writing. Similarly, Hirose (2003) compared L1 (Japanese) and L2 (English) organizational patterns in the argumentative writing of Japanese EFL student-writers. The study revealed that a majority of students employed deductive-type organizational patterns in both L1 and L2, challenging the stereotype that Japanese writers naturally favor inductive structures. This finding suggests that with increased proficiency and exposure to L2 writing conventions, learners may adopt organizational patterns aligned with those of native speakers.Other studies have maintained that not only the quality of writing but also the process of writing might be similar between L1 and L2 writing [3], thus justifying the extensive use of L1 process-oriented models of writing in investigating L2 writing [7, 8].

Following these findings, this study employs a rubric traditionally used to assess Korean L1 speakers’ writing samples for evaluating Korean L2 learners’ writing. The rubric includes three overarching categories—expression, structure, and content—each with multiple subcategories that provide clear, operationalized criteria for assessment. This rubric is widely used in Korean language education settings and was originally developed and refined by experts in L1 writing assessment to ensure clarity, objectivity, and comprehensiveness in evaluating student performance.

The first category, expression, includes three subcategories: grammatical accuracy, appropriateness of vocabulary use, and appropriateness of expressions. Grammatical accuracy evaluates the correctness of sentence structures, ensuring that sentences are grammatically correct. Appropriateness of vocabulary use measures whether the vocabulary used is suitable for the given context and whether there is a variety of vocabulary used throughout the text. Appropriateness of expressions assesses the variety and suitability of expressions and whether the length of sentences is appropriate.

The second category, structure, consists of four subcategories: appropriateness of internal structure, organization of essay structure, coherency of paragraphs, and appropriateness of length. Appropriateness of internal structure measures how cohesively the thesis and supporting sentences are connected. Appropriateness of intra-structure assesses whether the essay includes an introduction, body, and conclusion paragraphs, and whether the length of each paragraph is appropriate. Coherency of paragraphs evaluates the use of conjunctions and transition sentences to ensure smooth flow between ideas. Appropriateness of length checks if the essay meets the required word limit.

The third category, content, includes clarity of topic, detailed explanation, and creativity of thought. Clarity of topic measures whether the topic is clearly presented in the essay and whether each sentence relates to the topic. Detailed explanation assesses whether the explanation is detailed and varied. Creativity of thought examines the uniqueness, logic, and presentation of new and different viewpoints.

Table 1 shows the full rubric used for scoring, including the three main categories—expression, structure, and content—and their respective subcategories and descriptions.

Category	Subcategories	Descriptions
Expression	Grammatical Accuracy	Correct use of sentence structures.
	Appropriateness of Vocabulary Use	Vocabulary suitability and variety.
	Appropriateness of Expressions	Variety and naturalness of expressions; sentence length appropriateness.
Structure	Appropriateness of Internal Structure	Coherence between thesis and supporting sentences.
	Organization of Essay Structure	Presence and proportionality of introduction, body, and conclusion.
	Coherency of Paragraphs	Logical flow and use of discourse markers.
	Appropriateness of Length	Adherence to required word count.
Content	Clarity of Topic	Clear presentation of the topic and relevance of content.
	Detailed Explanation	Depth and variety in explanation.
	Creativity of Thought	Originality, logical structure, and new perspectives.

Table 1: Rubric categories and subcategories used for evaluating Korean L2 essays.

To ensure that the application of this rubric constituted a standardized assessment procedure, we implemented a multi-step annotator training and calibration process. Two annotators with M.A. degrees in Korean Language Education and extensive teaching and rating experience independently assessed student essays using the rubric. Prior to rating, they underwent two training sessions led by the researcher. In the first session, the rubric categories were thoroughly explained and five sample essays from each proficiency level were collaboratively scored. Discrepancies were discussed in depth until consensus was reached, thereby aligning interpretations of the rubric criteria. In the second session, annotators independently scored essays and reconvened with the researcher to resolve any inconsistencies in interpretation through discussion and clarification.

This process mirrors best practices in standardized assessment, particularly in language testing, where inter-annotator reliability, operationalized rubrics, and calibration sessions are critical for maintaining scoring consistency across annotators and contexts. After independent grading, inter-annotator agreement was measured using Cohen’s kappa.

Table 2 presents the results of Cohen’s kappa, a statistical measure of inter-annotator agreement. This measure accounts for agreement occurring by chance and is commonly used in annotation tasks to evaluate the consistency and reliability of multiple annotators. In our study, we compute Cohen’s kappa across four learner groups: foreign beginner (FB), foreign intermediate (FI), heritage beginner (HB), and heritage intermediate (HI), along with the overall agreement across all groups. The results show strong agreement across all groups, with the highest agreement observed for foreign beginners (0.8253) and the lowest for heritage intermediate learners (0.7245), likely reflecting the relative complexity and fluency of sentences in this group. Overall, the aggregated kappa score of 0.8241 indicates high inter-annotator consistency in our error annotation framework.

Taken together, the use of a detailed, pre-established rubric, combined with structured annotator training and verification of inter-annotator reliability, meets key criteria for a standardized assessment procedure. This provides a strong foundation for using the resulting dataset in further research on L2 Korean writing development.

FB	FI	HB	HI	All
0.8253	0.7921	0.7971	0.7245	0.8241

Table 2: Cohen’s kappa scores across learner subgroups: foreign beginners (FB), foreign intermediates (FI), heritage beginners (HB), and heritage intermediates (HI), as well as the overall agreement (All).

In summary, the adoption of a detailed rubric and a carefully implemented annotation protocol ensures a high level of scoring reliability and transparency. These procedures establish the rubric-scored KoLLA corpus as a valuable benchmark for evaluating Korean L2 writing proficiency and developing automated essay scoring systems grounded in pedagogically meaningful criteria.

4 Findings and Discussion

The augmented KoLLA corpus reveals valuable insights into both learner error patterns and the pedagogical potential of rubric-based writing evaluation. This section discusses these findings in two parts: (i) the implications of multi-reference GEC in capturing linguistic variability in Korean, and (ii) the significance of rubric-based scoring for evaluating overall writing quality and learner proficiency.

4.1 Multi-reference grammatical error annotation

The enhancements to the KoLLA corpus provide insights into grammatical error patterns in Korean L2 writing, with particular emphasis on the benefits of multi-reference GEC. To the best of our knowledge, the multi-reference KoLLA corpus represents the first resource of this kind specifically tailored for Korean L2. It addresses inherent linguistic variability in learner writing, thereby enabling more comprehensive and contextually appropriate evaluations.

Single-reference GEC datasets may overlook natural variability in learner language, potentially penalizing alternative yet valid corrections. In contrast, multi-reference annotations capture the diversity of valid grammatical structures, which is particularly critical given Korean’s syntactic and morphological flexibility. Korean’s syntactic flexibility frequently allows multiple grammatical structures or synonyms to express equivalent meanings without compromising grammatical correctness. Consequently, corrections for case markers, verb endings, and compound verbs often exhibit considerable variability, each reflecting a valid linguistic choice dependent on discourse context.

Lexical flexibility also contributes notably to correction diversity, with many corrections allowing synonymous substitutions without altering intended meaning. This characteristic underscores the importance of flexible evaluation frameworks in GEC systems, prioritizing meaning preservation rather than strict grammatical conformity.

In subsequent analyses, we plan to further investigate specific instances of word-boundary inconsistencies and lexical alternatives by conducting a quantitative error analysis, examining correlations with proficiency levels and evaluating how these variations affect automated correction model performance.

4.2 Rubric-based scoring for writing assessment

Beyond grammatical error correction, the introduction of rubric-based scoring provides a broader assessment framework for evaluating Korean L2 writing proficiency. Unlike GEC, which focuses primarily on grammatical accuracy at the sentence level, rubric-based scoring assesses higher-level attributes of writing quality, including coherence, structure, and content development. By incorporating rubric scores aligned with the Korean National Language Institute guidelines, the KoLLA corpus establishes a standardized framework for evaluating L2 writing performance.

Analysis of the rubric scores indicates a strong correlation between grammatical accuracy and overall writing proficiency. Essays with fewer grammatical errors consistently received higher rubric scores, suggesting that error-free writing contributes to improved coherence and clarity. This trend underscores the interdependence of grammatical competence and discourse-level writing skills. Additionally, lexical diversity emerges as a distinguishing factor among proficiency levels. Essays that exhibit a wider range of vocabulary usage tend to receive higher scores, reinforcing the role of lexical variation in language proficiency assessment.

Structural complexity is another key indicator of writing ability, as higher-scoring essays demonstrate stronger paragraph organization, better use of discourse markers, and smoother transitions between ideas. However, despite improvements in grammar and structure at higher proficiency levels, rubric scores reveal that certain discourse-level errors persist even among intermediate learners. Challenges in logical sequencing, underdeveloped arguments, and inappropriate rhetorical structures suggest that targeted instructional interventions are needed to address discourse-level deficiencies in L2 writing.

The rubric-based scoring also highlights differences in proficiency levels across various essay types. Argumentative essays tend to exhibit more discourse errors, particularly in structuring logical arguments and supporting claims with appropriate evidence. In contrast, personal essays often score higher in fluency and coherence but may suffer from lexical repetition. These distinctions indicate that different types of writing require tailored instructional approaches to address specific linguistic challenges.

4.3 Implications and Future Directions

The integration of multi-reference GEC and rubric-based scoring in the KoLLA corpus has significant implications for both Korean L2 learning and the development of automated NLP systems. The dataset provides detailed insights into common learner errors, allowing language educators to design targeted curricula that address frequent grammatical challenges such as postposition omission, honorific mismatches, and discourse structure issues. The refined annotation framework offers a valuable resource for the development of instructional materials aimed at improving grammatical competence and writing fluency.

For GEC model development, the inclusion of multi-reference annotations reduces penalization for alternative but grammatically acceptable corrections, making it more suitable for training flexible and context-aware correction models for Korean. By incorporating diverse correction strategies, these models can better accommodate the inherent linguistic variability present in learner writing. Moreover, the multi-reference dataset provides a benchmark for evaluating future GEC systems, ensuring that corrections account for multiple valid grammatical structures rather than rigidly enforcing a single reference.

In the context of AES, rubric-based scores serve as a foundational dataset for training AI-driven assessment models that align closely with human grading standards. By integrating grammatical, lexical, and discourse-level features, these models can provide more nuanced feedback to learners, assisting in both formative and summative assessments of writing proficiency. The incorporation of rubric-based assessments into computational models presents an opportunity to develop more sophisticated evaluation systems capable of providing personalized and structured feedback.

Future research should explore the scalability of multi-reference GEC datasets by incorporating additional annotators to capture a wider range of linguistic variations. Expanding the corpus with discourse-level error annotations could further enhance its utility for both GEC and AES research. Additionally, the integration of rubric scoring into neural network-based assessment models offers promising directions for improving automated feedback systems in Korean L2 writing instruction.

Overall, the findings from this study emphasize the necessity of incorporating both multi-reference grammatical error detection and rubric-based scoring in Korean learner corpora. These enhancements provide a more comprehensive framework for analyzing L2 writing, bridging the gap between error correction, writing assessment, and instructional design.

4.4 Summary of findings

The enhanced KoLLA corpus demonstrates the value of integrating both multi-reference grammatical error annotations and rubric-based scoring for a more holistic evaluation of Korean L2 writing. Multi-reference GEC captures the linguistic variability and ambiguity intrinsic to Korean, while rubric-based scoring offers a structured and pedagogically grounded method for assessing broader aspects of writing proficiency. Together, these resources support both computational development and instructional practice, paving the way for more accurate, fair, and informative evaluation of learner language.

5 Conclusion

This study enhances the KoLLA Korean learner corpus by incorporating multi-reference GEC annotations and rubric-based scoring, addressing critical gaps in Korean L2 writing research. By transforming KoLLA into a multi-reference GEC corpus, we account for the inherent variability in grammatical correction, allowing for more robust evaluations of error patterns and correction strategies. The refined annotation scheme introduces a more structured and linguistically informed approach to error categorization, improving the consistency and accuracy of error analysis. Additionally, the integration of rubric-based scoring provides a standardized framework for assessing writing proficiency, bridging the gap between holistic and detailed linguistic evaluations.

Our findings highlight key grammatical challenges faced by Korean L2 learners, with frequent errors in postpositions, verb conjugations, and word-boundary segmentation. The rubric-based assessments reveal strong correlations between grammatical accuracy, lexical diversity, and overall writing quality, reinforcing the importance of GEC in L2 proficiency development. These enhancements establish KoLLA as a foundational resource for advancing research in language learning, pedagogy, and computational linguistics. The dataset provides insights for educators seeking to design targeted instructional interventions and for researchers developing more effective GEC and automated writing assessment systems.

Several promising directions emerge from this work. Further research into AI-driven rubric scoring systems may refine automated assessment methodologies, enhancing the capacity to provide personalized feedback to learners. Expanding the multi-reference GEC framework to other languages with complex morphological structures could yield broader cross-linguistic insights into second-language acquisition. Moreover, integrating this enriched dataset into AI-powered learning platforms has the potential to facilitate adaptive language instruction, making automated feedback more accessible and effective for Korean L2 learners.

By addressing both linguistic and computational challenges, this study contributes to the ongoing development of Korean learner corpora, reinforcing the importance of high-quality, systematically annotated data in language learning and assessment. The enriched KoLLA corpus serves as a benchmark for future research, paving the way for more sophisticated tools that support Korean L2 learners in achieving greater writing proficiency.

Data Availability Statement

The dataset is available at https://doi.org/10.5281/zenodo.15287129 with a DOI and is released under the GNU General Public License Version 3, consistent with the licensing of the original dataset.

Acknowledgments

This research was supported by the 2024 Korean Studies Grant Program of the Academy of Korean Studies (AKS-2024-R-094), awarded to Jungyeul Park.

References

\bibcommenthead
C. Bryant \BOthers. [\APACyear2017] \APACinsertmetastarbryant-etal-2017-automatic{APACrefauthors}Bryant, C., Felice, M.\BCBL Briscoe, T. \APACrefYearMonthDay20177. \BBOQ\APACrefatitleAutomatic Annotation and Evaluation of Error Types for Grammatical Error Correction Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction.\BBCQ R. Barzilay \BBA M\BHBIY. Kan (\BEDS), \APACrefbtitleProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) (\BPGS 793–805). \APACaddressPublisherVancouver, CanadaAssociation for Computational Linguistics. {APACrefURL} https://aclanthology.org/P17-1074 \PrintBackRefs\CurrentBib
C.J. Bryant [\APACyear2019] \APACinsertmetastarbryant:2019{APACrefauthors}Bryant, C.J. \APACrefYear2019. \APACrefbtitleAutomatic annotation of error types for grammatical error correction Automatic annotation of error types for grammatical error correction \APACtypeAddressSchool\BPhDCambridge, UKUniversity of Cambridge, Churchill College. {APACrefDOI} https://doi.org/10.17863/CAM.40832 \PrintBackRefs\CurrentBib
Davoodifard [\APACyear2022] \APACinsertmetastardavoodifard-2022-overview{APACrefauthors}Davoodifard, M. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleAn Overview of Writing Process Research: Using Innovative Tasks and Techniques for a Better Understanding of L2 Writing Processes in Assessment Contexts An Overview of Writing Process Research: Using Innovative Tasks and Techniques for a Better Understanding of L2 Writing Processes in Assessment Contexts.\BBCQ \APACjournalVolNumPagesStudies in Applied Linguistics & TESOL at Teachers College, Columbia University2121–20, {APACrefDOI} https://doi.org/https://doi.org/10.52214/salt.v21i2.8759 \PrintBackRefs\CurrentBib
Dickinson \BOthers. [\APACyear2010] \APACinsertmetastardickinson-israel-lee-2010-building{APACrefauthors}Dickinson, M., Israel, R.\BCBL Lee, S\BHBIH. \APACrefYearMonthDay2010. \BBOQ\APACrefatitleBuilding a Korean Web Corpus for Analyzing Learner Language Building a Korean Web Corpus for Analyzing Learner Language.\BBCQ \APACrefbtitleProceedings of the 6th Workshop on the Web as Corpus (WAC-6) Proceedings of the 6th workshop on the web as corpus (wac-6) (\BPGS 8–16). \APACaddressPublisherLos Angeles. {APACrefURL} http://jones.ling.indiana.edu/ mdickinson/papers/dickinson-israel-lee10.html \PrintBackRefs\CurrentBib
Fathman \BBA Kobayashi [\APACyear1984] \APACinsertmetastarfathman-kobayashi-1984-rhetorical{APACrefauthors}Fathman, A.\BCBT \BBA Kobayashi, H. \APACrefYearMonthDay1984. \BBOQ\APACrefatitleRhetorical Patterns in English and Japanese Rhetorical patterns in english and japanese.\BBCQ \APACjournalVolNumPagesTESOL Quarterly184737-738, {APACrefDOI} https://doi.org/https://doi.org/10.2307/3586587 \PrintBackRefs\CurrentBib
Gu \BOthers. [\APACyear2025] \APACinsertmetastargu-etal-2025-improving{APACrefauthors}Gu, Y., Huang, Z., Zeng, M., Qiu, M.\BCBL Park, J. \APACrefYearMonthDay20251. \BBOQ\APACrefatitleImproving Automatic Grammatical Error Annotation for Chinese Through Linguistically-Informed Error Typology Improving Automatic Grammatical Error Annotation for Chinese Through Linguistically-Informed Error Typology.\BBCQ O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B.D. Eugenio\BCBL \BBA S. Schockaert (\BEDS), \APACrefbtitleProceedings of the 31st International Conference on Computational Linguistics Proceedings of the 31st international conference on computational linguistics (\BPGS 2781–2798). \APACaddressPublisherAbu Dhabi, UAEAssociation for Computational Linguistics. {APACrefURL} https://aclanthology.org/2025.coling-main.189/ \PrintBackRefs\CurrentBib
Hayes [\APACyear2000] \APACinsertmetastarhayes-2000-new{APACrefauthors}Hayes, J.R. \APACrefYear2000. \APACrefbtitleA new framework for understanding cognition and affect in writing. A new framework for understanding cognition and affect in writing. \APACaddressPublisherNewark, DE, USInternational Reading Association. \PrintBackRefs\CurrentBib
Hayes [\APACyear2012] \APACinsertmetastarhayes-2012-modeling{APACrefauthors}Hayes, J.R. \APACrefYearMonthDay2012. \BBOQ\APACrefatitleModeling and Remodeling Writing Modeling and Remodeling Writing.\BBCQ \APACjournalVolNumPagesWritten Communication293369–388, {APACrefDOI} https://doi.org/10.1177/0741088312451260 \PrintBackRefs\CurrentBib
He \BOthers. [\APACyear2022] \APACinsertmetastarhe-etal-2022-automated{APACrefauthors}He, Y., Jiang, F., Chu, X.\BCBL Li, P. \APACrefYearMonthDay202210. \BBOQ\APACrefatitleAutomated Chinese Essay Scoring from Multiple Traits Automated Chinese Essay Scoring from Multiple Traits.\BBCQ \APACrefbtitleProceedings of the 29th International Conference on Computational Linguistics Proceedings of the 29th international conference on computational linguistics (\BPGS 3007–3016). \APACaddressPublisherGyeongju, Republic of KoreaInternational Committee on Computational Linguistics. {APACrefURL} https://aclanthology.org/2022.coling-1.266 \PrintBackRefs\CurrentBib
Hirose [\APACyear2001] \APACinsertmetastarhirose-2001-persuasive{APACrefauthors}Hirose, K. \APACrefYearMonthDay2001. \BBOQ\APACrefatitlePersuasive Writing in L1 and L2: A Look at Japanese EFL Students’ Rhetorical Organization Strategies Persuasive Writing in L1 and L2: A Look at Japanese EFL Students’ Rhetorical Organization Strategies.\BBCQ \APACjournalVolNumPagesJacet Bulletin 大学英語教育学会紀要13343–56, {APACrefURL} https://ndlsearch.ndl.go.jp/books/R000000004-I5707626 \PrintBackRefs\CurrentBib
Hirose [\APACyear2003] \APACinsertmetastarhirose-2003-comparing{APACrefauthors}Hirose, K. \APACrefYearMonthDay2003. \BBOQ\APACrefatitleComparing L1 and L2 organizational patterns in the argumentative writing of Japanese EFL students Comparing L1 and L2 organizational patterns in the argumentative writing of Japanese EFL students.\BBCQ \APACjournalVolNumPagesJournal of Second Language Writing122181–209, {APACrefDOI} https://doi.org/https://doi.org/10.1016/S1060-3743(03)00015-8 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S1060374303000158 \PrintBackRefs\CurrentBib
Israel \BOthers. [\APACyear2013] \APACinsertmetastarisrael-dickinson-lee:2013:IJCNLP{APACrefauthors}Israel, R., Dickinson, M.\BCBL Lee, S\BHBIH. \APACrefYearMonthDay201310. \BBOQ\APACrefatitleDetecting and Correcting Learner Korean Particle Omission Errors Detecting and Correcting Learner Korean Particle Omission Errors.\BBCQ \APACrefbtitleProceedings of the Sixth International Joint Conference on Natural Language Processing Proceedings of the sixth international joint conference on natural language processing (\BPGS 1419–1427). \APACaddressPublisherNagoya, JapanAsian Federation of Natural Language Processing. {APACrefURL} https://aclanthology.org/I13-1199 \PrintBackRefs\CurrentBib
Keck [\APACyear2006] \APACinsertmetastarkeck-2006-use{APACrefauthors}Keck, C. \APACrefYearMonthDay2006. \BBOQ\APACrefatitleThe use of paraphrase in summary writing: A comparison of L1 and L2 writers The use of paraphrase in summary writing: A comparison of L1 and L2 writers.\BBCQ \APACjournalVolNumPagesJournal of Second Language Writing154261–278, {APACrefDOI} https://doi.org/https://doi.org/10.1016/j.jslw.2006.09.006 {APACrefURL} https://www.sciencedirect.com/science/article/pii/S106037430600066X \PrintBackRefs\CurrentBib
Le Bruyn \BBA Paquot [\APACyear2021] \APACinsertmetastarlebruyn-paquot-2021-learner{APACrefauthors}Le Bruyn, B.\BCBT \BBA Paquot, M. \APACrefYear2021. \APACrefbtitleLearner Corpus Research Meets Second Language Acquisition Learner Corpus Research Meets Second Language Acquisition. \APACaddressPublisherCambridgeCambridge University Press. \PrintBackRefs\CurrentBib
Lee \BOthers. [\APACyear2012] \APACinsertmetastarlee-dickinson-israel-2012-developing{APACrefauthors}Lee, S\BHBIH., Dickinson, M.\BCBL Israel, R. \APACrefYearMonthDay2012. \BBOQ\APACrefatitleDeveloping Learner Corpus Annotation for Korean Particle Errors Developing Learner Corpus Annotation for Korean Particle Errors.\BBCQ \APACrefbtitleProceedings of the Sixth Linguistic Annotation Workshop Proceedings of the sixth linguistic annotation workshop (\BPGS 129–133). \APACaddressPublisherJeju, Republic of KoreaAssociation for Computational Linguistics. {APACrefURL} http://www.aclweb.org/anthology/W12-3617 \PrintBackRefs\CurrentBib
Lim \BOthers. [\APACyear2023] \APACinsertmetastarlim-song-park-2023{APACrefauthors}Lim, K., Song, J.\BCBL Park, J. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleNeural automated writing evaluation for Korean L2 writing Neural automated writing evaluation for Korean L2 writing.\BBCQ \APACjournalVolNumPagesNatural Language Engineering2951341–1363, {APACrefDOI} https://doi.org/10.1017/S1351324922000298 {APACrefURL} https://www.cambridge.org/core/journals/natural-language-engineering/article/neural-automated-writing-evaluation-for-korean-l2-writing/C1C7090E7DB7F0756CCD0B4B002E096A \PrintBackRefs\CurrentBib
Lin [\APACyear2024] \APACinsertmetastarlin-2024-evaluating{APACrefauthors}Lin, S. \APACrefYearMonthDay2024. \BBOQ\APACrefatitleEvaluating LLMs’ grammatical error correction performance in learner Chinese Evaluating LLMs’ grammatical error correction performance in learner Chinese.\BBCQ \APACjournalVolNumPagesPLOS ONE19101–18, {APACrefDOI} https://doi.org/10.1371/journal.pone.0312881 {APACrefURL} https://doi.org/10.1371/journal.pone.0312881 \PrintBackRefs\CurrentBib
Mathias \BBA Bhattacharyya [\APACyear2018] \APACinsertmetastarmathias-bhattacharyya:2018:LREC{APACrefauthors}Mathias, S.\BCBT \BBA Bhattacharyya, P. \APACrefYearMonthDay20185. \BBOQ\APACrefatitleASAP++: Enriching the ASAP Automated Essay Grading Dataset with Essay Attribute Scores ASAP++: Enriching the ASAP Automated Essay Grading Dataset with Essay Attribute Scores.\BBCQ \APACrefbtitleProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) Proceedings of the eleventh international conference on language resources and evaluation (lrec 2018) (\BPGS 1169–1173). \APACaddressPublisherMiyazaki, JapanEuropean Language Resources Association (ELRA). {APACrefURL} https://aclanthology.org/L18-1187 \PrintBackRefs\CurrentBib
Park \BBA Lee [\APACyear2016] \APACinsertmetastarpark-lee:2016{APACrefauthors}Park, J.\BCBT \BBA Lee, J. \APACrefYearMonthDay2016. \BBOQ\APACrefatitleA Korean Learner Corpus and its Features A Korean Learner Corpus and its Features.\BBCQ \APACjournalVolNumPagesJournal of the Linguistic Society of Korea7569–85, {APACrefDOI} https://doi.org/10.17290/jlsk.2016..75.69 \PrintBackRefs\CurrentBib
Ridley \BOthers. [\APACyear2021] \APACinsertmetastarridley-EtAl:2021:AAAI{APACrefauthors}Ridley, R., He, L., Dai, X\BHBIy., Huang, S.\BCBL Chen, J. \APACrefYearMonthDay20215. \BBOQ\APACrefatitleAutomated Cross-prompt Scoring of Essay Traits Automated Cross-prompt Scoring of Essay Traits.\BBCQ \APACjournalVolNumPagesProceedings of the AAAI Conference on Artificial Intelligence351513745–13753, {APACrefURL} https://ojs.aaai.org/index.php/AAAI/article/view/17620 \PrintBackRefs\CurrentBib
Schmidt [\APACyear2010] \APACinsertmetastarschmidt-2010-linguistic{APACrefauthors}Schmidt, T. \APACrefYearMonthDay2010. \BBOQ\APACrefatitleLinguistic tool development between community practices and technology standards Linguistic tool development between community practices and technology standards.\BBCQ G. Budin, P. Wittenburg, L. Romary\BCBL \BBA T. Declerck (\BEDS), \APACrefbtitleProceedings of the Workshop on Language Resource and Language Technology Standards Proceedings of the workshop on language resource and language technology standards (\BPGS 69–72). \APACaddressPublisherMaltaELRA. \PrintBackRefs\CurrentBib
van Weijen \BOthers. [\APACyear2019] \APACinsertmetastarvanweijen-rijlaarsdam-vandenbergh-2019-source{APACrefauthors}van Weijen, D., Rijlaarsdam, G.\BCBL van den Bergh, H. \APACrefYearMonthDay2019. \BBOQ\APACrefatitleSource use and argumentation behavior in L1 and L2 writing: a within-writer comparison Source use and argumentation behavior in L1 and L2 writing: a within-writer comparison.\BBCQ \APACjournalVolNumPagesReading and Writing3261635–1655, {APACrefDOI} https://doi.org/10.1007/s11145-018-9842-9 {APACrefURL} https://doi.org/10.1007/s11145-018-9842-9 \PrintBackRefs\CurrentBib
Yannakoudakis \BOthers. [\APACyear2011] \APACinsertmetastaryannakoudakis-briscoe-medlock:2011:ACL-HLT{APACrefauthors}Yannakoudakis, H., Briscoe, T.\BCBL Medlock, B. \APACrefYearMonthDay20116. \BBOQ\APACrefatitleA New Dataset and Method for Automatically Grading ESOL Texts A New Dataset and Method for Automatically Grading ESOL Texts.\BBCQ \APACrefbtitleProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (\BPGS 180–189). \APACaddressPublisherPortland, Oregon, USAAssociation for Computational Linguistics. {APACrefURL} http://www.aclweb.org/anthology/P11-1019 \PrintBackRefs\CurrentBib
Yoon \BOthers. [\APACyear2023] \APACinsertmetastaryoon-etal-2023-towards{APACrefauthors}Yoon, S., Park, S., Kim, G., Cho, J., Park, K., Kim, G.T.\BDBLOh, A. \APACrefYearMonthDay20237. \BBOQ\APACrefatitleTowards standardizing Korean Grammatical Error Correction: Datasets and Annotation Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation.\BBCQ \APACrefbtitleProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (\BPGS 6713–6742). \APACaddressPublisherToronto, CanadaAssociation for Computational Linguistics. {APACrefURL} https://aclanthology.org/2023.acl-long.371 \PrintBackRefs\CurrentBib
Zhang \BBA Fu [\APACyear2024] \APACinsertmetastarzhang-fu-2024-review{APACrefauthors}Zhang, Y.\BCBT \BBA Fu, X. \APACrefYearMonthDay2024. \BBOQ\APACrefatitleReview of Collocations, Corpora and Language Learning Review of Collocations, Corpora and Language Learning.\BBCQ \APACjournalVolNumPagesCorpus Pragmatics, {APACrefDOI} https://doi.org/10.1007/s41701-024-00168-z {APACrefURL} https://doi.org/10.1007/s41701-024-00168-z \PrintBackRefs\CurrentBib
Zhang \BOthers. [\APACyear2022] \APACinsertmetastarzhang-etal-2022-mucgec{APACrefauthors}Zhang, Y., Li, Z., Bao, Z., Li, J., Zhang, B., Li, C.\BDBLZhang, M. \APACrefYearMonthDay20227. \BBOQ\APACrefatitleMuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction.\BBCQ \APACrefbtitleProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (\BPGS 3118–3130). \APACaddressPublisherSeattle, United StatesAssociation for Computational Linguistics. {APACrefURL} https://aclanthology.org/2022.naacl-main.227 \PrintBackRefs\CurrentBib

Appendix A Dataset Description

The enhanced KoLLA dataset comprises two primary components tailored to the distinct goals of grammatical error correction (GEC) and automated essay scoring (AES):

1.

GEC M2 File: This file follows the standardized M2 format commonly used in grammatical error annotation tasks. It contains pairwise alignments between original learner sentences and their corrected forms, annotated with detailed error tags based on the extended MRU typology adapted for Korean. Each sentence is associated with two reference corrections, allowing for multi-reference evaluation. Annotations encode the location, type, and nature of each error, capturing both surface-level and structural changes.
2.

AES .csv Files for FB, FI, HB, and HI: These files contain rubric-based scoring data for four learner groups—foreign beginners (FB), foreign intermediates (FI), heritage beginners (HB), and heritage intermediates (HI). Each CSV file includes 250 columns representing 10 rubric dimensions (Table 1) across 25 essays. The scoring framework is based on the Korean National Language Institute’s writing evaluation criteria, encompassing expression, structure, and content categories. Each file contains two rows corresponding to independent ratings by two trained annotators. This format allows for inter-rater reliability analysis and enables fine-grained evaluation of writing performance across demographic and proficiency subgroups.