This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generate, Annotate, and Learn:
NLP with Synthetic Text

Xuanli He1 Islam Nassar1 Jamie Kiros2 Gholamreza Haffari1 Mohammad Norouzi2
1Monash University, Australia 2Google Research, Brain Team
{xuanli.he1, gholamreza.haffari}@monash.edu, mnorouzi@google.com
Abstract

Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data. Knowledge distillation (KD) enables effective optimization of compact neural nets, achieving the best results when the knowledge of an expensive network is distilled via fresh task-specific unlabeled data. However, task-specific unlabeled data can be challenging to find, especially for NLP tasks. We investigate the use of generative models as a source of unlabeled data and present a simple and general framework called “Generate, Annotate, and Learn (GAL)”. A language model (LM) is used to synthesize in-domain unlabeled data. Then, a classifier is used to annotate such data with pseudo-labels. Finally, synthetically generated data is used to advance SSL, KD, and few-shot learning problems in NLP. To obtain a strong task-specific LM, we either fine-tune a large LM on inputs from a given task, or prompt a large LM with a few input examples and conditionally generate more unlabeled examples. GAL yields a new state of the art for 6-layer transformers on the GLUE leaderboard and shows promise on tabular and vision tasks.