Generate, Annotate, and Learn:
NLP with Synthetic Text
Abstract
Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data. Knowledge distillation (KD) enables effective optimization of compact neural nets, achieving the best results when the knowledge of an expensive network is distilled via fresh task-specific unlabeled data. However, task-specific unlabeled data can be challenging to find, especially for NLP tasks. We investigate the use of generative models as a source of unlabeled data and present a simple and general framework called “Generate, Annotate, and Learn (GAL)”. A language model (LM) is used to synthesize in-domain unlabeled data. Then, a classifier is used to annotate such data with pseudo-labels. Finally, synthetically generated data is used to advance SSL, KD, and few-shot learning problems in NLP. To obtain a strong task-specific LM, we either fine-tune a large LM on inputs from a given task, or prompt a large LM with a few input examples and conditionally generate more unlabeled examples. GAL yields a new state of the art for 6-layer transformers on the GLUE leaderboard and shows promise on tabular and vision tasks.