Generate, Annotate, and Learn:
NLP with Synthetic Text

Xuanli He¹ Islam Nassar¹ Jamie Kiros² Gholamreza Haffari¹ Mohammad Norouzi²
¹Monash University, Australia ²Google Research, Brain Team
{xuanli.he1, gholamreza.haffari}@monash.edu, mnorouzi@google.com

Abstract

Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data. Knowledge distillation (KD) enables effective optimization of compact neural nets, achieving the best results when the knowledge of an expensive network is distilled via fresh task-specific unlabeled data. However, task-specific unlabeled data can be challenging to find, especially for NLP tasks. We investigate the use of generative models as a source of unlabeled data and present a simple and general framework called “Generate, Annotate, and Learn (GAL)”. A language model (LM) is used to synthesize in-domain unlabeled data. Then, a classifier is used to annotate such data with pseudo-labels. Finally, synthetically generated data is used to advance SSL, KD, and few-shot learning problems in NLP. To obtain a strong task-specific LM, we either fine-tune a large LM on inputs from a given task, or prompt a large LM with a few input examples and conditionally generate more unlabeled examples. GAL yields a new state of the art for 6-layer transformers on the GLUE leaderboard and shows promise on tabular and vision tasks.

Generate, Annotate, and Learn: NLP with Synthetic Text

Abstract

Generate, Annotate, and Learn:
NLP with Synthetic Text