This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Inductive Program Synthesis over Noisy Data

Shivam Handa Electrical Engineering and Computer ScienceMassachusetts Institute of TechnologyU.S.A. shivam@mit.edu  and  Martin C. Rinard Electrical Engineering and Computer ScienceMassachusetts Institute of TechnologyU.S.A. rinard@csail.mit.edu
(2020)
Abstract.

We present a new framework and associated synthesis algorithms for program synthesis over noisy data, i.e., data that may contain incorrect/corrupted input-output examples. This framework is based on an extension of finite tree automata called state-weighted finite tree automata. We show how to apply this framework to formulate and solve a variety of program synthesis problems over noisy data. Results from our implemented system running on problems from the SyGuS 2018 benchmark suite highlight its ability to successfully synthesize programs in the face of noisy data sets, including the ability to synthesize a correct program even when every input-output example in the data set is corrupted.

Program Synthesis, Noisy Data, Corrupted Data, Machine Learning
copyright: rightsretaineddoi: 10.1145/3368089.3409732journalyear: 2020submissionid: fse20main-p447-pisbn: 978-1-4503-7043-1/20/11conference: Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering; November 8–13, 2020; Virtual Event, USAbooktitle: Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’20), November 8–13, 2020, Virtual Event, USAccs: Theory of computation Formal languages and automata theoryccs: Software and its engineering Programming by exampleccs: Computing methodologies Machine learning

1. Introduction

In recent years there has been significant interest in learning programs from input-output examples. These techniques have been successfully used to synthesize programs for domains such as string and format transformations (Gulwani, 2011; Singh and Gulwani, 2016), data wrangling (Feng et al., 2017), data completion (Wang et al., 2017b), and data structure manipulation (Feser et al., 2015; Osera and Zdancewic, 2015; Yaghmazadeh et al., 2016). Even though these efforts have been largely successful, they do not aspire to work with noisy data sets that may contain corrupted input-output examples.

We present a new program synthesis technique that is designed to work with noisy/corrupted data sets. Given:

  • Programs: A collection of programs pp defined by a grammar GG and a bounded scope threshold dd,

  • Data Set: A data set D={(σ1,o1),,(σn,on)}{\mathcal{}D}=\{(\sigma_{1},o_{1}),\ldots,(\sigma_{n},o_{n})\} of input-output examples,

  • Loss Function: A loss function L(p,D){\mathcal{}L}(p,{\mathcal{}D}) that measures the cost of the input-output examples on which pp produces a different output than the output in the data set D{\mathcal{}D},

  • Complexity Measure: A complexity measure C(p)C(p) that measures the complexity of a given program pp,

  • Objective Function: An arbitrary objective function U(l,c)U(l,c), which maps loss ll and complexity cc to a totally ordered set, such that for all values of ll, U(l,c)U(l,c) is monotonically nondecreasing with respect to cc,

our program synthesis technique produces a program pp that minimizes U(L(p,D),C(p))U({\mathcal{}L}(p,{\mathcal{}D}),C(p)). Example problems that our program synthesis technique can solve include:

  • Best Fit Program Synthesis: Given a potentially noisy data set D{\mathcal{}D}, find a best fit program pp for D{\mathcal{}D}, i.e., a pp that minimizes L(p,D){\mathcal{}L}(p,{\mathcal{}D}).

  • Accuracy vs. Complexity Tradeoff: Given a data set D{\mathcal{}D}, find pp that minimizes the weighted sum L(p,D)+λC(p){\mathcal{}L}(p,{\mathcal{}D})+\lambda\cdot C(p). This problem enables the programmer to define and minimize a weighted tradeoff between the complexity of the program and the loss.

  • Data Cleaning and Correction: Given a data set D{\mathcal{}D}, find pp that minimizes the loss L(p,D){\mathcal{}L}(p,{\mathcal{}D}). Input-output examples with nonzero loss are identified as corrupted and either 1) filtered out or 2) replaced with the output from the synthesized program.

  • Bayesian Program Synthesis: Given a data set D{\mathcal{}D} and a probability distribution π(p)\pi(p) over programs pp, find the most probable program pp given D{\mathcal{}D}.

  • Best Program for Given Accuracy: Given a data set D{\mathcal{}D} and a bound bb, find pp that minimizes C(p)C(p) subject to L(p,D)b{\mathcal{}L}(p,{\mathcal{}D})\leq b. One use case finds the simplest program that agrees with the data set D{\mathcal{}D} on at least nbn-b input-output examples.

  • Forced Accuracy: Given data sets D{\mathcal{}D}^{\prime}, D{\mathcal{}D}, where DD{\mathcal{}D}^{\prime}\subseteq{\mathcal{}D}, find pp that minimizes the weighted sum L(p,D)+λC(p){\mathcal{}L}(p,{\mathcal{}D})+\lambda\cdot C(p) subject to L(p,D)b{\mathcal{}L}(p,{\mathcal{}D}^{\prime})\leq b. One use case finds a program pp which minimizes the loss over the data set D{\mathcal{}D} but is always correct for D{\mathcal{}D}^{\prime}.

  • Approximate Program Synthesis: Given a clean (noise-free) data set D{\mathcal{}D}, find the least complex program pp that minimizes the loss L(p,D){\mathcal{}L}(p,{\mathcal{}D}). Here the goal is not to work with a noisy data set, but instead to find the best approximate solution to a synthesis problem when an exact solution does not exist within the collection of considered programs pp.

1.1. Noise Models and Discrete Noise Sources

We work with noise models that assume a (hidden) clean data set combined with a noise source that delivers the noisy data set presented to the program synthesis system. Like many inductive program synthesis systems (Gulwani, 2011; Singh and Gulwani, 2016), one target is discrete problems that involve discrete data such as strings, data structures, or tablular data. In contrast to traditional machine learning problems, which often involve continuous noise sources (Bishop, 2006), the noise sources for discrete problems often involve discrete noise — noise that involves a discrete change that affects only part of each output, leaving the remaining parts intact and uncorrupted.

1.2. Loss Functions and Use Cases

Different loss functions can be appropriate for different noise sources and use cases. The 0/10/1 loss function, which counts the number of input-output examples where the data set D{\mathcal{}D} and synthesized program pp do not agree, is a general loss function that can be appropriate when the focus is to maximize the number of inputs for which the synthesized program pp produces the correct output. The Damerau-Levenshtein (DL) loss function (Damerau, 1964), which measures the edit difference under character insertions, deletions, substitutions, and/or transpositions, extracts information present in partially corrupted outputs and can be appropriate for measuring discrete noise in input-output examples involving text strings. The 0/0/\infty loss function, which is \infty unless pp agrees with all of the input-output examples in the data set D{\mathcal{}D}, specializes our technique to the standard program synthesis scenario that requires the synthesized program to agree with all input-output examples.

Because discrete noise sources often leave parts of corrupted outputs intact, exact program synthesis (i.e., synthesizing a program that agrees with all outputs in the hidden clean data set) is often possible even when all outputs in the data set are corrupted. Our experimental results (Section 9) indicate that matching the loss function to the characteristics of the discrete noise source can enable very accurate program synthesis even when 1) there are only a handful of input-output examples in the data and 2) all of the outputs in the data set are corrupted. We attribute this success to the ability of our synthesis technique, working in conjunction with an appropriately designed loss function, to effectively extract information from outputs corrupted by discrete noise sources.

1.3. Technique

Our technique augments finite tree automata (FTA) to associate accepting states with weights that capture the loss for the output associated with each accepting state. Given a data set D{\mathcal{}D}, the resulting state-weighted finite tree automata (SFTA) partition the programs pp defined by the grammar GG into equivalence classes. Each equivalence class consists of all programs with identical input-output behavior on the inputs in D{\mathcal{}D}. All programs in a given equivalence class therefore have the same loss over D{\mathcal{}D}. The technique then uses dynamic programming to find the minimum complexity program pp in each equivalence class (Gallo et al., 1993). From this set of minimum complexity programs, the technique then finds the program pp that minimizes the objective function U(p,D)U(p,{\mathcal{}D}).

1.4. Experimental Results

We have implemented our technique and applied it to various programs in the SyGuS 2018 benchmark set (SyG, 2018). The results indicate that the technique is effective at solving program synthesis problems over strings with modestly sized solutions even in the presence of substantial noise. For discrete noise sources and a loss function that is a good match for the noise source, the technique is typically able to extract enough information left intact in corrupted outputs to synthesize a correct program even when all outputs are corrupted (in this paper we consider a synthesized program to be correct if it agrees with all input-output examples in the original hidden clean data set). Even with the 0/1 loss function, which does not aspire to extract any information from corrupted outputs, the technique is typically able to synthesize a correct program even with only a few correct (uncorrupted) input-output examples in the data set. Overall the results highlight the potential for effective program synthesis even in the presence of substantial noise.

1.5. Contributions

This paper makes the following contributions:

  • Technique: It presents an implemented technique for inductive program synthesis over noisy data. The technique uses an extension of finite tree automata, state-weighted finite tree automata, to synthesize programs that minimize an objective function involving the loss over the input data set and the complexity of the synthesized program.

  • Use Cases: It presents multiple uses cases including best fit program synthesis for noisy data sets, navigating accuracy vs. complexity tradeoffs, Bayesian program synthesis, identifying and correcting corrupted data, and approximate program synthesis.

  • Experimental Results: It presents experimental results from our implemented system on the SyGuS 2018 benchmark set. These results characterize the scalability of the technique and highlight interactions between the DSL, the noise source, the loss function, and the overall effectiveness of the synthesis technique. In particular, they highlight the ability of the technique to, given a close match between the noise source and the loss function, synthesize a correct program pp even when 1) there are only a handful of input-output examples in the data set D{\mathcal{}D} and 2) all outputs are corrupted.

2. Preliminaries

We next review finite tree automata (FTA) and FTA-based inductive program synthesis.

2.1. Finite Tree Automata

Finite tree automata are a type of state machine which accept trees rather than strings. They generalize standard finite automata to describe a regular language over trees.

Definition 0 (FTA).

A (bottom-up) finite tree automaton (FTA) over alphabet FF is a tuple A=(Q,F,Qf,Δ){\mathcal{}A}=(Q,F,Q_{f},\Delta) where QQ is a set of states, QfQQ_{f}\subseteq Q is the set of accepting states and Δ\Delta is a set of transitions of the form f(q1,,qk)qf(q_{1},\ldots,q_{k})\rightarrow q where q,q1,qkq,q_{1},\ldots q_{k} are states, fFf\in F.

Every symbol ff in alphabet FF has an associated arity. The set FkFF_{k}\subseteq F is the set of all kk-arity symbols in FF. 0-arity terms tt in FF are viewed as single node trees (leaves of trees). tt is accepted by an FTA if we can rewrite tt to some state qQfq\in Q_{f} using rules in Δ\Delta. The language of an FTAFTA A{\mathcal{}A}, denoted by L(A){\mathcal{}L}({\mathcal{}A}), corresponds to the set of all ground terms accepted by A{\mathcal{}A}.

Example 0.

Consider the tree automaton A{\mathcal{}A} defined by states Q={qT,qF}Q=\{q_{T},q_{F}\}, F0={𝖳𝗋𝗎𝖾,𝖥𝖺𝗅𝗌𝖾}F_{0}=\{\mathsf{True},\mathsf{False}\}, F1=𝗇𝗈𝗍F_{1}=\mathsf{not}, F2={𝖺𝗇𝖽}F_{2}=\{\mathsf{and}\}, final states Qf={qT}Q_{f}=\{q_{T}\} and the following transition rules Δ\Delta:

𝖳𝗋𝗎𝖾qT𝖥𝖺𝗅𝗌𝖾qF𝗇𝗈𝗍(qT)qF𝗇𝗈𝗍(qF)qT𝖺𝗇𝖽(qT,qT)qT𝖺𝗇𝖽(qF,qT)qF𝖺𝗇𝖽(qT,qF)qF𝖺𝗇𝖽(qF,qF)qF𝗈𝗋(qT,qT)qT𝗈𝗋(qF,qT)qT𝗈𝗋(qT,qF)qT𝗈𝗋(qF,qF)qF\begin{array}[]{cc}\mathsf{True}\rightarrow q_{T}&\mathsf{False}\rightarrow q_{F}\\ \mathsf{not}(q_{T})\rightarrow q_{F}&\mathsf{not}(q_{F})\rightarrow q_{T}\\ \mathsf{and}(q_{T},q_{T})\rightarrow q_{T}&\mathsf{and}(q_{F},q_{T})\rightarrow q_{F}\\ \mathsf{and}(q_{T},q_{F})\rightarrow q_{F}&\mathsf{and}(q_{F},q_{F})\rightarrow q_{F}\\ \mathsf{or}(q_{T},q_{T})\rightarrow q_{T}&\mathsf{or}(q_{F},q_{T})\rightarrow q_{T}\\ \mathsf{or}(q_{T},q_{F})\rightarrow q_{T}&\mathsf{or}(q_{F},q_{F})\rightarrow q_{F}\\ \end{array}

The above tree automaton accepts all propositional logic formulas over 𝖳𝗋𝗎𝖾\mathsf{True} and 𝖥𝖺𝗅𝗌𝖾\mathsf{False} which evaluate to 𝖳𝗋𝗎𝖾\mathsf{True}. Figure 1 presents the tree for the formula 𝖺𝗇𝖽(𝖳𝗋𝗎𝖾,𝗇𝗈𝗍(𝖥𝖺𝗅𝗌𝖾))\mathsf{and}(\mathsf{True},\mathsf{not}(\mathsf{False})).

𝖺𝗇𝖽\mathsf{and}𝖳𝗋𝗎𝖾\mathsf{True}𝗇𝗈𝗍\mathsf{not}𝖥𝖺𝗅𝗌𝖾\mathsf{False}
Figure 1. Tree for formula 𝖺𝗇𝖽(𝖳𝗋𝗎𝖾,𝗇𝗈𝗍(𝖥𝖺𝗅𝗌𝖾))\mathsf{and}(\mathsf{True},\mathsf{not}(\mathsf{False}))

2.2. Domain Specific Languages (DSLs)

We next define the programs we consider, how inputs to the program are specified, and the program semantics. Without loss of generality, we assume programs pp are specified as parse trees in a domain-specific language (DSL) grammar GG. Internal nodes represent function invocations; leaves are constants/0-arity symbols in the DSL. A program pp executes in an input σ\sigma. pσ\llbracket p\rrbracket\sigma denotes the output of pp on input σ\sigma (.\llbracket.\rrbracket is defined in Figure 2).

(Constant)cσc(Variable)xσσ(x)n1σv1n2σv2nkσvk(Function)f(n1,n2,nk)σf(v1,v2,vk)\begin{array}[]{c}\begin{array}[]{cc}\begin{array}[]{c}\llbracket c\rrbracket\sigma\Rightarrow c\end{array}\begin{array}[]{c}\end{array}&\begin{array}[]{c}\llbracket x\rrbracket\sigma\Rightarrow\sigma(x)\end{array}\begin{array}[]{c}\end{array}\end{array}\\ \\ \begin{array}[]{c}\llbracket f(n_{1},n_{2},\ldots n_{k})\rrbracket\sigma\Rightarrow f(v_{1},v_{2},\ldots v_{k})\end{array}\begin{array}[]{c}\llbracket n_{1}\rrbracket\sigma\Rightarrow v_{1}~{}~{}~{}~{}~{}\llbracket n_{2}\rrbracket\sigma\Rightarrow v_{2}~{}~{}~{}~{}\ldots~{}~{}~{}~{}\llbracket n_{k}\rrbracket\sigma\Rightarrow v_{k}\end{array}\end{array}
Figure 2. Execution semantics for program pp

All valid programs (which can be executed) are defined by a DSL grammar G=(T,N,P,s0)G=(T,N,P,s_{0}) where:

  • TT is a set of terminal symbols. These may include constants and symbols which may change value depending on the input σ\sigma.

  • NN is the set of nonterminals that represent subexpressions in our DSL.

  • PP is the set of‘ production rules of the form
    sf(s1,,sn)s\rightarrow f(s_{1},\ldots,s_{n}), where ff is a built-in function in the DSL and s,s1,,sns,s_{1},\ldots,s_{n} are non-terminals in the grammar.

  • s0Ns_{0}\in N is the start non-terminal in the grammar.

We assume that we are given a black box implementation of each built-in function ff in the DSL. In general, all techniques explored within this paper can be generalized to any DSL which can be specified within the above framework.

Example 0.

The following DSL defines expressions over input x, constants 2 and 3, and addition and multiplication:

n:=x|n+t|n×t;t:=2|3;\begin{array}[]{rcl}n&:=&x~{}~{}|~{}~{}n+t~{}~{}|~{}~{}n\times t;\\ t&:=&2~{}~{}|~{}~{}3;\end{array}

2.3. Concrete Finite Tree Automata

We review the approach introduced by (Wang et al., 2017a, b) to use finite tree automata to solve synthesis tasks over a broad class of DSLs. Given a DSL and a set of input-output examples, a Concrete Finite Tree Automaton (CFTA) is a tree automaton which accepts all trees representing DSL programs consistent with the input-output examples and nothing else. The states of the FTA correspond to concrete values and the transitions are obtained using the semantics of the DSL constructs.

tT,tσ=c(Term)qtcQqs0oQ(Final)qs0oQfsf(s1,sk)P,{qs1c1,,qskck}Q,f(c1,ck)σ=c(Prod)qscQ,f(qs1c1,,qskck)qscΔ\begin{array}[]{c}\begin{array}[]{cc}\begin{array}[]{c}q^{c}_{t}\in Q\end{array}\begin{array}[]{c}t\in T,~{}~{}~{}~{}~{}\llbracket t\rrbracket\sigma=c\end{array}&\begin{array}[]{c}q^{o}_{s_{0}}\in Q_{f}\end{array}\begin{array}[]{c}q^{o}_{s_{0}}\in Q\end{array}\\ \end{array}\\ \\ \begin{array}[]{c}q_{s}^{c}\in Q,~{}~{}~{}~{}~{}f(q_{s_{1}}^{c_{1}},\ldots,q_{s_{k}}^{c_{k}})\rightarrow q_{s}^{c}\in\Delta\end{array}\begin{array}[]{c}s\rightarrow f(s_{1},\ldots s_{k})\in P,~{}~{}~{}~{}~{}~{}~{}~{}~{}\{q_{s_{1}}^{c_{1}},\ldots,q_{s_{k}}^{c_{k}}\}\subseteq Q,\\ \llbracket f(c_{1},\ldots c_{k})\rrbracket\sigma=c\end{array}\end{array}
Figure 3. Rules for constructing a CFTA A=(Q,F,Qf,Δ){\mathcal{}A}=(Q,F,Q_{f},\Delta) given input σ\sigma, output oo, and grammar G=(T,N,P,s0)G=(T,N,P,s_{0}).

Given an input-output example (σ,o)(\sigma,o) and DSL (G,.)(G,\llbracket.\rrbracket), construct a CFTA using the rules in Figure 3. The alphabet of the CFTA contains built-in functions within the DSL. The states in the CFTA are of the form qscq_{s}^{c}, where ss is a symbol (terminal or non-terminal) in GG and cc is a concrete value. The existence of state qscq^{c}_{s} implies that there exists a partial program which can map σ\sigma to concrete value cc. Similarly, the existence of transition f(qs1c1,qs2c2qskck)qscf(q_{s_{1}}^{c_{1}},q_{s_{2}}^{c_{2}}\ldots q_{s_{k}}^{c_{k}})\rightarrow q_{s}^{c} implies f(c1,c2ck)=cf(c_{1},c_{2}\ldots c_{k})=c.

The 𝖳𝖾𝗋𝗆\mathsf{Term} rule states that if we have a terminal tt (either a constant in our language or input symbol xx), execute it with the input σ\sigma and construct a state qtcq_{t}^{c} (where c=tσc=\llbracket t\rrbracket\sigma). The 𝖥𝗂𝗇𝖺𝗅\mathsf{Final} rule states that, given start symbol s0s_{0} and we expect oo as the output, if qs0oq^{o}_{s_{0}} exists, then we have an accepting state. The 𝖯𝗋𝗈𝖽\mathsf{Prod} rule states that, if we have a production rule f(s1,s2,sk)sΔf(s_{1},s_{2},\ldots s_{k})\rightarrow s\in\Delta, and there exists states qs1c1,qs2c2qskckQq_{s_{1}}^{c_{1}},q_{s_{2}}^{c_{2}}\ldots q_{s_{k}}^{c_{k}}\in Q, then we also have state qscq_{s}^{c} in the CFTA and a transition f(qs1c1,qs2c2,qskck)qscf(q_{s_{1}}^{c_{1}},q_{s_{2}}^{c_{2}},\ldots q_{s_{k}}^{c_{k}})\rightarrow q_{s}^{c}.

The language of the CFTA constructed from Figure 3 is exactly the set of parse trees of DSL programs that are consistent with our input-output example (i.e., maps input σ\sigma to output oo).

In general, the rules in Figure 3 may result in a CFTA which has infinitely many states. To control the size of the resulting CFTA, we do not add a new state within the constructed CFTA if the smallest tree it will accept is larger than a given threshold dd. This results in a CFTA which accepts all programs which are consistent with the input-output example but are smaller than the given threshold (it may accept some programs which are larger than the given threshold but it will never accept a program which is inconsistent with the input-output example). This is standard practice in the synthesis literature (Wang et al., 2017a; Polozov and Gulwani, 2015).

2.4. Intersection of Two CFTAs

Given two CFTAs A1{\mathcal{}A}_{1} and A2{\mathcal{}A}_{2} built over the same grammar GG from input-output examples (σ1,o1)(\sigma_{1},o_{1}) and (σ2,o2)(\sigma_{2},o_{2}) respectively, the intersection of these two automata A{\mathcal{}A} contains programs which satisfy both input-output examples (or has the empty language). Given CFTAs A=(Q,F,Qf,Δ){\mathcal{}A}=(Q,F,Q_{f},\Delta) and A=(Q,F,Qf,Δ){\mathcal{}A}^{\prime}=(Q^{\prime},F^{\prime},Q^{\prime}_{f},\Delta), A=(Q,F,Qf,Δ){\mathcal{}A}^{*}=(Q^{*},F,Q^{*}_{f},\Delta^{*}) is the intersection of A{\mathcal{}A} and A{\mathcal{}A}^{\prime}, where Q,Qf,Q^{*},Q^{*}_{f}, and Δ\Delta^{*} are the smallest set such that:

qsc1Q and qsc2Q then qsc1:c2Qq^{\vec{c_{1}}}_{s}\in Q\text{ and }q^{\vec{c_{2}}}_{s}\in Q^{\prime}\text{ then }q^{\vec{c_{1}}:\vec{c_{2}}}_{s}\in Q^{*}
qsc1Qf and qsc2Qf then qsc1:c2Qfq^{\vec{c_{1}}}_{s}\in Q_{f}\text{ and }q^{\vec{c_{2}}}_{s}\in Q^{\prime}_{f}\text{ then }q^{\vec{c_{1}}:\vec{c_{2}}}_{s}\in Q^{*}_{f}
f(qs1c1,qskck)qscΔ and f(qs1c1,qskck)qscΔf(q^{\vec{c_{1}}}_{s_{1}},\ldots q^{\vec{c_{k}}}_{s_{k}})\rightarrow q^{\vec{c}}_{s}\in\Delta\text{ and }f(q^{\vec{c^{\prime}_{1}}}_{s_{1}},\ldots q^{\vec{c^{\prime}_{k}}}_{s_{k}})\rightarrow q^{\vec{c^{\prime}}}_{s}\in\Delta^{\prime}
then f(qs1c1:c1,qskck:ck)qsc:cΔ\text{then }f(q^{\vec{c_{1}}:\vec{c^{\prime}_{1}}}_{s_{1}},\ldots q^{\vec{c_{k}}:\vec{c^{\prime}_{k}}}_{s_{k}})\rightarrow q^{\vec{c}:\vec{c^{\prime}}}_{s}\in\Delta^{*}

where c\vec{c} denotes a vector of values and c1:c2\vec{c_{1}}:\vec{c_{2}} denote a vector constructed by appending vector v2\vec{v_{2}} at the end of vector v1\vec{v_{1}}.

3. Loss Functions

Given a data set D={(σ1,o1),,(σn,on)}{\mathcal{}D}=\{(\sigma_{1},o_{1}),\ldots,(\sigma_{n},o_{n})\} and a program pp, a Loss Function L(p,D){\mathcal{}L}(p,{\mathcal{}D}) calculates how incorrect the program is with respect to the given data set. We work with loss functions L(p,D){\mathcal{}L}(p,{\mathcal{}D}) that depend only on the data set and the outputs of the program for the inputs in the data set, i.e., given programs p1,p2p_{1},p_{2}, such that for all (σi,oi)D(\sigma_{i},o_{i})\in{\mathcal{}D}, p1σi=p2σi\llbracket p_{1}\rrbracket\sigma_{i}=\llbracket p_{2}\rrbracket\sigma_{i}, then L(p1,D)=L(p2,D){\mathcal{}L}(p_{1},{\mathcal{}D})={\mathcal{}L}(p_{2},{\mathcal{}D}). We also further assume that the loss function L(p,D){\mathcal{}L}(p,{\mathcal{}D}) can be expressed in the following form:

L(p,D)=i=1nL(oi,pσi){\mathcal{}L}(p,{\mathcal{}D})=\sum\limits_{i=1}^{n}L(o_{i},\llbracket p\rrbracket\sigma_{i})

where L(oi,pσi)L(o_{i},\llbracket p\rrbracket\sigma_{i}) is a per-example loss function.

Definition 0.

0/10/1 Loss Function: The 0/10/1 loss function
L0/1(p,D){\mathcal{}L}_{0/1}(p,{\mathcal{}D}) counts the number of input-output examples where pp does not agree with the data set D{\mathcal{}D}:

L0/1(p,D)=i=1n1if(oipσi)else0{\mathcal{}L}_{0/1}(p,{\mathcal{}D})=\sum\limits_{i=1}^{n}1~{}~{}\mathrm{if}~{}~{}(o_{i}\neq\llbracket p\rrbracket\sigma_{i})~{}~{}\mathrm{else}~{}~{}0
Definition 0.

0/0/\infty Loss Function: The 0/0/\infty loss function
L0/(p,D){\mathcal{}L}_{0/\infty}(p,{\mathcal{}D}) is 0 if pp matches all outputs in the data set D{\mathcal{}D} and \infty otherwise:

L0/(p,D)=0if((σ,o)D.o=pσ)else{\mathcal{}L}_{0/\infty}(p,{\mathcal{}D})=0~{}~{}\mathrm{if}~{}~{}(\forall(\sigma,o)\in{\mathcal{}D}.o=\llbracket p\rrbracket\sigma)~{}~{}\mathrm{else}~{}~{}~{}\infty
Definition 0.

Damerau-Levenshtein (DL) Loss Function: The DL loss function LDL(p,D){\mathcal{}L}_{DL}(p,{\mathcal{}D}) uses the Damerau-Levenshtein metric (Damerau, 1964), to measure the distance between the output from the synthesized program and the corresponding output in the noisy data set:

LDL(p,D)=(σi,oi)DLpσi,oi(|pσi|,|oi|){\mathcal{}L}_{DL}(p,{\mathcal{}D})=\sum\limits_{(\sigma_{i},o_{i})\in{\mathcal{}D}}L_{\llbracket p\rrbracket\sigma_{i},o_{i}}\big{(}\left|\llbracket p\rrbracket\sigma_{i}\right|,\left|o_{i}\right|\big{)}

where, La,b(i,j)L_{a,b}(i,j) is the Damerau-Levenshtein metric (Damerau, 1964).

This metric counts the number of single character deletions, insertions, substitutions, or transpositions required to convert one text string into another. Because more than 80%80\% of all human misspellings are reported to be captured by a single one of these four operations (Damerau, 1964), the DL loss function may be appropriate for computations that work with human-provided text input-output examples.

4. Complexity Measure

Given a program pp, a Complexity Measure C(p)C(p) ranks programs independent of the input-output examples in the data set D{\mathcal{}D}. This measure is used to trade off performance on the noisy data set vs. complexity of the synthesized program. Formally, a complexity measure is a function C(p)C(p) that maps each program pp expressible in the given DSL GG to a real number. The following Cost(p)\mathrm{Cost}(p) complexity measure computes the complexity of given program pp represented as a parse tree recursively as follows:

Cost(t)=cost(t)Cost(f(e1,e2,ek))=cost(f)+i=1kCost(ei)\begin{array}[]{rcl}\mathrm{Cost}(t)&=&\mathrm{cost}(t)\\ \mathrm{Cost}(f(e_{1},e_{2},\ldots e_{k}))&=&\mathrm{cost}(f)+\sum\limits_{i=1}^{k}\mathrm{Cost}(e_{i})\end{array}

where tt and ff are terminals and built-in functions in our DSL respectively. Setting cost(t)=cost(f)=1\mathrm{cost}(t)=\mathrm{cost}(f)=1 delivers a complexity measure Size(p)\mathrm{Size}(p) that computes the size of pp.

Given an FTA A{\mathcal{}A}, we can use dynamic programming to find the minimum complexity parse tree (under the above Cost(p)\mathrm{Cost}(p) measure) accepted by A{\mathcal{}A} (Gallo et al., 1993). In general, given an FTA A{\mathcal{}A}, we assume we are provided with a method to find the program pp accepted by A{\mathcal{}A} which minimizes the complexity measure.

5. Objective Functions

Given loss ll and complexity cc, an Objective Function U(l,c)U(l,c) maps l,cl,c to a totally ordered set such that for all ll, U(l,c)U(l,c) is monotonically nondecreasing with respect to cc.

Definition 0.

Tradeoff Objective Function: Given a tradeoff parameter λ>0\lambda>0, the tradeoff objective function Uλ(l,c)=l+λcU_{\lambda}(l,c)=l+\lambda c.

This objective function trades the loss of the synthesized program off against the complexity of the synthesized program. Similarly to how regularization can prevent a machine learning model from overfitting noisy data by biasing the training algorithm to pick a simpler model, the tradeoff objective function may prevent the algorithm from synthesizing a program which overfits the data by biasing it to pick a simpler program (based on the complexity measure).

Definition 0.

Lexicographic Objective Function: A lexicographic objective function UL(l,c)=l,cU_{L}(l,c)=\langle l,c\rangle maps ll and cc into a lexicographically ordered space, i.e., l1,c1<l1,c2\langle l_{1},c_{1}\rangle<\langle l_{1},c_{2}\rangle if and only if either l1<l2l_{1}<l_{2} or l1=l2l_{1}=l_{2} and c1<c2c_{1}<c_{2}.

This objective function first minimizes the loss, then the complexity. It may be appropriate, for example, for best fit program synthesis, data cleaning and correction, and approximate program synthesis over clean data sets.

6. State-Weighted FTA

State-weighted finite tree automata (SFTA) are FTA augmented with a weight function that attaches a weight to all accepting states.

Definition 0 (SFTA).

A state-weighted finite tree automaton (SFTA) over alphabet FF is a tuple A=(Q,F,Qf,Δ,w){\mathcal{}A}=(Q,F,Q_{f},\Delta,w) where QQ is a set of states, QfQQ_{f}\subseteq Q is the set of accepting states, Δ\Delta is a set of transitions of the form f(q1,,qk)qf(q_{1},\ldots,q_{k})\rightarrow q where q,q1,qkq,q_{1},\ldots q_{k} are states, fFf\in F and w:Qfw:Q_{f}\rightarrow\mathds{R} is a function which assigns a weight w(q)w(q) (from domain WW) to each accepting state qQfq\in Q_{f}.

Because CFTAs are designed to handle synthesis over clean (noise-free) data sets, they have only one accept state qs0oq^{o}_{s_{0}} (the state with start symbol s0s_{0} and output value oo). We weaken this condition to allow multiple accept states with attached weights using SFTAs.

tT,tσ=c(Term)qtcQqs0cQ(Final)qs0cQfw(qs0c)=L(o,c)sf(s1,sk)P,{qs1c1,,qskck}Q,f(c1,ck)σ=c(Prod)qscQ,f(qs1c1,,qskck)qscΔ\begin{array}[]{c}\begin{array}[]{cc}\begin{array}[]{c}q^{c}_{t}\in Q\end{array}\begin{array}[]{c}t\in T,~{}~{}~{}~{}~{}\llbracket t\rrbracket\sigma=c\end{array}&\begin{array}[]{c}q^{c}_{s_{0}}\in Q_{f}\\ w(q^{c}_{s_{0}})=L(o,c)\end{array}\begin{array}[]{c}q^{c}_{s_{0}}\in Q\end{array}\\ \end{array}\\ \\ \begin{array}[]{c}q_{s}^{c}\in Q,~{}~{}~{}~{}~{}f(q_{s_{1}}^{c_{1}},\ldots,q_{s_{k}}^{c_{k}})\rightarrow q_{s}^{c}\in\Delta\end{array}\begin{array}[]{c}s\rightarrow f(s_{1},\ldots s_{k})\in P,~{}~{}~{}~{}~{}~{}~{}~{}~{}\{q_{s_{1}}^{c_{1}},\ldots,q_{s_{k}}^{c_{k}}\}\subseteq Q,\\ \llbracket f(c_{1},\ldots c_{k})\rrbracket\sigma=c\end{array}\end{array}
Figure 4. Rules for constructing a SFTA A=(Q,F,Qf,Δ,w){\mathcal{}A}=(Q,F,Q_{f},\Delta,w) given input σ\sigma, per-example loss function LL, and grammar G=(T,N,P,s0)G=(T,N,P,s_{0}).

Given an input-output example (σ,o)(\sigma,o) and per-example loss function L(o,c)L(o,c), Figure 4 presents rules for constructing a SFTA that, given a program pp, returns the loss for pp on example (σ,o)(\sigma,o). The SFTA 𝖥𝗂𝗇𝖺𝗅\mathsf{Final} rule (Figure 4) marks all states qs0cq^{c}_{s_{0}} with start symbol s0s_{0} as accepting states regardless of the concrete value cc attached to the state. The rule also associates the loss L(o,c)L(o,c) for concrete value cc and output oo with state qs0cq^{c}_{s_{0}} as the weight w(qs0c)=L(o,c)w(q^{c}_{s_{0}})=L(o,c). The CFTA 𝖥𝗂𝗇𝖺𝗅\mathsf{Final} rule (Figure 3), in contrast, marks only the state qs0oq^{o}_{s_{0}} (with output value oo and start state s0s_{0}) as the accepting state.

A SFTA divides the set of programs in the DSL into subsets. Given an input σ\sigma, all programs in a subset produce the same output (based on the accepting state), with the SFTA assigning a weight w(qs0c)=L(o,c)w(q^{c}_{s_{0}})=L(o,c) as the weight of this subset of programs.

We denote the SFTA constructed from DSL GG, example (σ,o)(\sigma,o), per-example loss function LL, and threshold dd as AGd(σ,o,L){\mathcal{}A}_{G}^{d}(\sigma,o,L). We omit the subscript grammar GG and threshold dd wherever it is obvious from context.

11xx1,641,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}64}2,492,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}49}4,254,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}25}3,363,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}36}6,96,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}9}5,165,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}16}9,09,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0}8,18,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}1}12,912,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}9}7,47,{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}4}id+2,×3+2,\times 3×2\times 2+3+3+2,×2+2,\times 2×3\times 3+3+3×2,+3\times 2,+3+2+2×3\times 3×2\times 2+2+2×3\times 3+3+3
Figure 5. The SFTA constructed for Example 2
Example 0.

Consider the DSL presented in Example 3. Given input-output example ({x1},9)(\{x\rightarrow 1\},9) and weight function l(c)=(c9)2l(c)=(c-9)^{2}, Figure 5 presents the SFTA which represents all programs of height less than 33.

For readability, we omit the states for terminals 22 and 33. For all accepting states the first number (the number in black) represents the computed value and the second number (the number in red) represents the weight of the accepting state.

6.1. Operations over SFTAs

Definition 0 (++ Intersection).

Given two SFTAs
A1=(Q1,F,Qf1,Δ1,w1){\mathcal{}A}_{1}=(Q_{1},F,Q^{1}_{f},\Delta_{1},w_{1}) and A2=(Q2,F,Qf2,Δ2,w2){\mathcal{}A}_{2}=(Q_{2},F,Q^{2}_{f},\Delta_{2},w_{2}), a SFTA A=(Q,F,Qf,Δ,w){\mathcal{}A}=(Q,F,Q_{f},\Delta,w) is the ++ intersection A1{\mathcal{}A}_{1} and A2{\mathcal{}A}_{2}, if the CFTA in A{\mathcal{}A} is the intersection of CFTAs of A1{\mathcal{}A}_{1} and A2{\mathcal{}A}_{2}, and the weight of accept states in A{\mathcal{}A} is the sum of weight of corresponding weights in A1{\mathcal{}A}_{1} and A2{\mathcal{}A}_{2}. Formally:

  • The CFTA (Q,F,Qf,Δ)(Q,F,Q_{f},\Delta) is the intersection of CFTAs
    (Q1,F,Qf1,Δ1)(Q_{1},F,Q^{1}_{f},\Delta_{1}) and (Q2,F,Qf2,Δ2)(Q_{2},F,Q^{2}_{f},\Delta_{2})

  • w(qsc1:c2)=w1(qsc1)+w2(qsc2)w(q^{\vec{c_{1}}:\vec{c_{2}}}_{s})=w_{1}(q^{\vec{c_{1}}}_{s})+w_{2}(q^{\vec{c_{2}}}_{s}) (for qsc1:c2Qfq^{\vec{c_{1}}:\vec{c_{2}}}_{s}\in Q_{f}).

Given two SFTAs A1{\mathcal{}A}_{1} and A2{\mathcal{}A}_{2}, A1+A2{\mathcal{}A}_{1}+{\mathcal{}A}_{2} denotes the ++ intersection of A1{\mathcal{}A}_{1} and A2{\mathcal{}A}_{2}.

Definition 0 (// Intersection).

Given a SFTA
A=(Q,F,Qf,Δ,w){\mathcal{}A}=(Q,F,Q_{f},\Delta,w) and a CFTA A=(Q,F,Qf,Δ){\mathcal{}A}^{*}=(Q^{*},F,Q^{*}_{f},\Delta^{*}), a SFTA A=(Q,F,Qf,Δ,w){\mathcal{}A}^{\prime}=(Q^{\prime},F,Q^{\prime}_{f},\Delta^{\prime},w^{\prime}) is the // intersection of A{\mathcal{}A} and A{\mathcal{}A}^{*}, if the FTA A{\mathcal{}A}^{\prime} is the intersection of CFTA A{\mathcal{}A} and A{\mathcal{}A}^{*}, and the weight of the accepting state in A{\mathcal{}A}^{\prime} is the same as the weight of the corresponding accepting state in A{\mathcal{}A}. Formally:

  • The CFTA (Q,F,Qf,Δ)(Q^{\prime},F,Q^{\prime}_{f},\Delta^{\prime}) is the intersection of FTAs
    (Q,F,Qf,Δ)(Q,F,Q_{f},\Delta) and (Q,F,Qf,Δ)(Q^{*},F,Q^{*}_{f},\Delta^{*})

  • w(qsc1:c2)=w(qsc1:c2)w^{\prime}(q^{\vec{c_{1}}:\vec{c_{2}}}_{s})=w(q^{\vec{c_{1}}:\vec{c_{2}}}_{s}) (for qsc1:c2Qfq^{\vec{c_{1}}:\vec{c_{2}}}_{s}\in Q^{\prime}_{f}).

Given a SFTA A{\mathcal{}A} and a CFTA A{\mathcal{}A}^{*}, A/A{\mathcal{}A}/{\mathcal{}A}^{*} denotes the // intersection of A{\mathcal{}A} and A{\mathcal{}A}^{*}.

Given a single input-output example, a CFTA built on that example only accepts programs which are consistent with that example. // intersection is a simple method to prune a SFTA to only contain programs which are consistent with an input-output example.

Definition 0 (w0w_{0}-pruned SFTA).

A SFTA
A=(Q,F,Qf,Δ,w){\mathcal{}A}^{\prime}=(Q,F,Q^{\prime}_{f},\Delta,w^{\prime}) is the w0w_{0}-pruned SFTA of
A=(Q,F,Qf,Δ,w){\mathcal{}A}=(Q,F,Q_{f},\Delta,w) if we remove all accept states with weights greater w0w_{0} from QfQ_{f}. Formally, Qf={q|qQfw0<=w(q)}Q^{\prime}_{f}=\{q|q\in Q_{f}\wedge w_{0}<=w(q)\} and w(q)=w(q)w^{\prime}(q)=w(q) if qQfq\in Q^{\prime}_{f}.

Given a SFTA A{\mathcal{}A}, Aw0{\mathcal{}A}\downarrow_{w_{0}} denotes the w0w_{0}-pruned SFTA of A{\mathcal{}A}.

Definition 0 (qq-selection).

Given a SFTA
A=(Q,F,Qf,Δ,w){\mathcal{}A}=(Q,F,Q_{f},\Delta,w) and a accept state qQfq\in Q_{f}, the CFTA (Q,F,{q},Δ)(Q,F,\{q\},\Delta) is called the qq-selection of SFTA A{\mathcal{}A}.

Given a SFTA A{\mathcal{}A}, the notation Aq{\mathcal{}A}_{q} denotes the qq-selection of SFTA A{\mathcal{}A}.

7. Synthesis Over Noisy Data

Given a data set {(σ1,o1),,(σn,on)}\{(\sigma_{1},o_{1}),\ldots,(\sigma_{n},o_{n})\} of input-output examples and loss function L(p,D){\mathcal{}L}(p,{\mathcal{}D}) with per-example loss function LL, we construct SFTAs for each input-output example A1,A2,An{\mathcal{}A}_{1},{\mathcal{}A}_{2},\ldots{\mathcal{}A}_{n} where Ai=A(σi,oi,L){\mathcal{}A}_{i}={\mathcal{}A}(\sigma_{i},o_{i},L).

Theorem 1.

Given a SFTA A=A(σ,o,L)=(Q,F,Qf,Δ,w){\mathcal{}A}={\mathcal{}A}(\sigma,o,L)=(Q,F,Q_{f},\Delta,w), for all accepting states qQfq\in Q_{f} and for all programs pp accepted by the qq-selection automata Aq{\mathcal{}A}_{q}:

L(o,pσ)=w(q)L(o,\llbracket p\rrbracket\sigma)=w(q)
Proof.

Consider any state qQfq\in Q_{f}. All programs accepted by state qq compute the same concrete value cc on the given input σ\sigma. Hence for all programs accepted by the qq-selection automata Aq{\mathcal{}A}_{q}, pσ=c\llbracket p\rrbracket\sigma=c. By construction (Figure 4), w(q)=L(c)=L(pσ)w(q)=L(c)=L(\llbracket p\rrbracket\sigma)

Let SFTA A(D,L){\mathcal{}A}({\mathcal{}D},L) be the ++ intersection of SFTAs defined on input-output examples in D{\mathcal{}D}. Formally:

A(D,L)=A(σ1,o1,L)+A(σ2,o2,L)+A(σn,on,L){\mathcal{}A}({\mathcal{}D},L)={\mathcal{}A}(\sigma_{1},o_{1},L)+{\mathcal{}A}(\sigma_{2},o_{2},L)+\ldots{\mathcal{}A}(\sigma_{n},o_{n},L)

Since the size of each SFTA A(σi,oi,L){\mathcal{}A}(\sigma_{i},o_{i},L) is bounded, the cost of computing A(D,L){\mathcal{}A}({\mathcal{}D},L) is 𝒪(|D|)\mathcal{O}(|{\mathcal{}D}|).

Theorem 2.

Given A(D,L)=(Q,F,Qf,Δ,w){\mathcal{}A}({\mathcal{}D},L)=(Q,F,Q_{f},\Delta,w) as defined above, for all accepting states qQfq\in Q_{f}, for all programs pp accepted by the qq-selection automata A(D,L)q{\mathcal{}A}({\mathcal{}D},L)_{q}:

L(p,D)=w(q){\mathcal{}L}(p,{\mathcal{}D})=w(q)

i.e., the weight of the state qq measures the loss of programs by qq on data set D{\mathcal{}D}.

Proof.

Consider any accepting state qQfq\in Q_{f}. Since A(D,L){\mathcal{}A}({\mathcal{}D},L) is an intersection of SFTAs A1{\mathcal{}A}_{1}An{\mathcal{}A}_{n} (where Ai=A(σi,oi,L)=Qi,F,(Qf)i,Δi,wi){\mathcal{}A}_{i}={\mathcal{}A}(\sigma_{i},o_{i},L)=Q_{i},F,(Q_{f})_{i},\Delta_{i},w_{i})), there exist accepting states
q1(Qf)1,q2(Qf)2,qn(Qf)nq_{1}\in(Q_{f})_{1},q_{2}\in(Q_{f})_{2},\ldots q_{n}\in(Q_{f})_{n} such that all programs pp accepted by Aq{\mathcal{}A}_{q} are accepted by (A1)q1,(A2)q2(An)qn({\mathcal{}A}_{1})_{q_{1}},({\mathcal{}A}_{2})_{q_{2}}\ldots({\mathcal{}A}_{n})_{q_{n}}.

From Theorem 1, for all programs pp accepted by Aq{\mathcal{}A}_{q}, wi(qi)=L(oi,pσi)w_{i}(q_{i})=L(o_{i},\llbracket p\rrbracket\sigma_{i}) From definition of ++ intersection,

w(q)=i=1nwi(qi)=i=1nL(oi,pσi)=L(p,D)w(q)=\sum\limits_{i=1}^{n}w_{i}(q_{i})=\sum\limits_{i=1}^{n}L(o_{i},\llbracket p\rrbracket\sigma_{i})={\mathcal{}L}(p,{\mathcal{}D})

Input : DSL GG, threshold dd, data set D{\mathcal{}D}, per-example loss function LL, complexity measure CC, and objective function UU
Result: Synthesized program pp^{*}
A(D,L)=(Q,F,Qf,Δ,w){\mathcal{}A}({\mathcal{}D},L)=(Q,F,Q_{f},\Delta,w)
#the SFTA over data set D{\mathcal{}D} and per-example loss function LL
foreach qQfq\in Q_{f} do
       pq𝖺𝗋𝗀𝗆𝗂𝗇pA(D,L)qC(p)p_{q}\leftarrow\mathsf{argmin}_{p\in{\mathcal{}A}({\mathcal{}D},L)_{q}}C(p)
       # For each accepting state qq, find the most optimal program pqp_{q}
end foreach
q𝖺𝗋𝗀𝗆𝗂𝗇qQfU(w(q),C(pq))q^{*}\leftarrow\mathsf{argmin}_{q\in Q_{f}}U(w(q),C(p_{q}))
ppqp^{*}\leftarrow p_{q^{*}}
Algorithm 1 Synthesis Algorithm

Algorithm 1 presents the base algorithm to synthesize programs within various noisy synthesis settings.

Theorem 3.

The program pp^{*} returned by Algorithm 1 is equal to pp^{\prime} where

p=𝖺𝗋𝗀𝗆𝗂𝗇pGdU(L(p,D),C(p))p^{\prime}=\mathsf{argmin}_{p\in G_{d}}U({\mathcal{}L}(p,{\mathcal{}D}),C(p))
Proof.

Given A(D,L)=(Q,F,Qf,Δ,w){\mathcal{}A}({\mathcal{}D},L)=(Q,F,Q_{f},\Delta,w). pp^{*} returned by Algorithm 1 is equal to pqp_{q^{*}}, where q=𝖺𝗋𝗀𝗆𝗂𝗇qQfU(w(q),C(pq))q^{*}=\mathsf{argmin}_{q\in Q_{f}}U(w(q),C(p_{q})), where pq=𝖺𝗋𝗀𝗆𝗂𝗇pA(D,L)qC(p)p_{q}=\mathsf{argmin}_{p\in{\mathcal{}A}({\mathcal{}D},L)_{q}}C(p). We can rewrite qq^{*} as

𝖺𝗋𝗀𝗆𝗂𝗇qQfU(w(q),minpA(D,L)qC(p))\mathsf{argmin}_{q\in Q_{f}}U(w(q),\min_{p\in{\mathcal{}A}({\mathcal{}D},L)_{q}}C(p))

Since for any ll, U(l,c)U(l,c) is non-decreasing with respect to cc, we can rewrite qq^{*} as

𝖺𝗋𝗀𝗆𝗂𝗇qQfminpA(D,L)qU(w(q),C(p))\mathsf{argmin}_{q\in Q_{f}}\min_{p\in{\mathcal{}A}({\mathcal{}D},L)_{q}}U(w(q),C(p))

By Theorem 2, for any pA(D,L)qp\in{\mathcal{}A}({\mathcal{}D},L)_{q}:

w(q)=L(p,D)w(q)={\mathcal{}L}(p,{\mathcal{}D})
q=𝖺𝗋𝗀𝗆𝗂𝗇qQfminpA(D,L)qU(L(p,D),C(p))q^{*}=\mathsf{argmin}_{q\in Q_{f}}\min_{p\in{\mathcal{}A}({\mathcal{}D},L)_{q}}U({\mathcal{}L}(p,{\mathcal{}D}),C(p))

Because qq^{*} is the accepting state of pp^{*} and pA(D,L)p\in{\mathcal{}A}({\mathcal{}D},L) if and only if qQf.pA(D,L)q\exists~{}q\in Q_{f}.p\in{\mathcal{}A}({\mathcal{}D},L)_{q}, we can rewrite the above equation as:

p=𝖺𝗋𝗀𝗆𝗂𝗇pA(D,L)U(L(p,D),C(p))p^{*}=\mathsf{argmin}_{p\in{\mathcal{}A}({\mathcal{}D},L)}U({\mathcal{}L}(p,{\mathcal{}D}),C(p))

The set of programs accepted by A(D,L){\mathcal{}A}({\mathcal{}D},L) is the same set of programs in grammar GdG_{d}. Hence proved. ∎

We next present several modifications of the core algorithm to solve various synthesis problems.

7.1. Accuracy vs. Complexity Tradeoff

Given a DSL GG, a data set D{\mathcal{}D}, loss function L{\mathcal{}L}, complexity measure cc, and positive weight λ\lambda, we wish to find a program pp^{*} which minimizes the weighed sum of the loss function and the complexity measure. Formally:

p=𝖺𝗋𝗀𝗆𝗂𝗇pGd(L(p,D)+λC(p))p^{*}=\mathsf{argmin}_{p\in G_{d}}({\mathcal{}L}(p,{\mathcal{}D})+\lambda\cdot C(p))

where GdG_{d} is the set of programs in DSL GG with size less than the threshold dd. By using the objective function U(l,c)=l+λcU(l,c)=l+\lambda c, we can use Algorithm 1 to synthesize program pp^{*} which minimizes the objective function given above.

7.2. Best Program for Given Accuracy

Given a DSL GG, a data set D{\mathcal{}D}, loss function L{\mathcal{}L}, complexity measure CC and bound bb, we wish to find a program pp^{*} that minimizes the complexity measure CC but has loss less than bb. Formally: p=𝖺𝗋𝗀𝗆𝗂𝗇pGdC(p) s.t. L(p,D)<bp^{*}=\mathsf{argmin}_{p\in G_{d}}C(p)~{}\text{ s.t. }{\mathcal{}L}(p,{\mathcal{}D})<b. Note that this condition can be rewritten as

p=𝖺𝗋𝗀𝗆𝗂𝗇pAC(p)p^{*}=\mathsf{argmin}_{p\in{\mathcal{}A}^{\prime}}C(p)

where A=A(D,L)b{\mathcal{}A}^{\prime}={\mathcal{}A}({\mathcal{}D},L)\downarrow_{b}.

By the definition of b\downarrow_{b}, all accepting states of A{\mathcal{}A}^{\prime} have weight less than bb. Therefore all programs accepted by A{\mathcal{}A}^{\prime} have loss less than bb (i.e. L(p,D)<b{\mathcal{}L}(p,{\mathcal{}D})<b). Also note that if a program pp is not in A{\mathcal{}A}^{\prime} then either it has loss greater than bb or it is not within the threshold dd.

7.3. Forced Accuracy

Given DSL GG, a data set D{\mathcal{}D}, a subset DD{\mathcal{}D}^{\prime}\subseteq{\mathcal{}D}, loss function L{\mathcal{}L}, complexity measure CC, and objective function UU, we wish to find a program pp^{*} which minimizes the objective function with an added constraint of bounded loss over data set D{\mathcal{}D}^{\prime}. Formally:

p=𝖺𝗋𝗀𝗆𝗂𝗇pGdU(L(p,D),C(p)) s.t. L(p,D)bp^{*}=\mathsf{argmin}_{p\in G_{d}}U({\mathcal{}L}(p,{\mathcal{}D}),C(p))\text{ s.t. }{\mathcal{}L}(p,{\mathcal{}D}^{\prime})\leq b

We first construct a SFTA A(D,L)b{\mathcal{}A}({\mathcal{}D}^{\prime},L)\downarrow_{b} which contains all programs consistent with loss less than or equal to bb over data set D{\mathcal{}D}^{\prime}. After constructing A(D,L){\mathcal{}A}({\mathcal{}D},L) as in Algorithm 1, we modify A(D,L){\mathcal{}A}({\mathcal{}D},L) by // intersection A(D,L){\mathcal{}A}({\mathcal{}D}^{\prime},L) (after dropping the weights of the accepting states) with A(D,L){\mathcal{}A}({\mathcal{}D},L) (i.e. A(D,L)A(D,L)/A(D,L){\mathcal{}A}({\mathcal{}D},L)\leftarrow{\mathcal{}A}({\mathcal{}D},L)/{\mathcal{}A}({\mathcal{}D}^{\prime},L) as in Algorithm 1). By definition of // intersection and A{\mathcal{}A}, loss of all programs returned by the modified algorithm on D{\mathcal{}D}^{\prime} will be less than equal to bb.

8. Use Cases

Definition 0.

Bayesian Program Synthesis: Given a set of input-output examples D={(σi,oi)|i=1n}{\mathcal{}D}=\{(\sigma_{i},o_{i})|i=1\ldots n\}, DSL grammar GG, and a probability distribution π\pi, pp^{*} is the solution to the Bayesian program synthesis problem, if pp^{*} is the most probable program in DSL GG, given the data set D{\mathcal{}D}. Formally p=𝖺𝗋𝗀𝗆𝖺𝗑pGπ(p|D)p^{*}=\mathsf{argmax}_{p\in G}\pi(p~{}|~{}{\mathcal{}D}).

By Bayes rule p=𝖺𝗋𝗀𝗆𝖺𝗑pGπ(D|p)π(p)p^{*}=\mathsf{argmax}_{p\in G}\pi({\mathcal{}D}|p)\pi(p), so

p=𝖺𝗋𝗀𝗆𝖺𝗑pG[(logπ(D|p))+(logπ(p))]p^{*}=\mathsf{argmax}_{p\in G}\big{[}(\log\pi({\mathcal{}D}|p))+(\log\pi(p))\big{]}

Assuming independence of observations:

p=𝖺𝗋𝗀𝗆𝖺𝗑pGd[((σi,oi)Dlogπ(oi|pσi))+(logπ(p))]p^{*}=\mathsf{argmax}_{p\in G_{d}}\big{[}\Big{(}\sum_{(\sigma_{i},o_{i})\in{\mathcal{}D}}\log\pi(o_{i}|\llbracket p\rrbracket\sigma_{i})\Big{)}+(\log\pi(p))\big{]}

Where π(oi|pσi)\pi(o_{i}|\llbracket p\rrbracket\sigma_{i}) denotes the probability of output observation oio_{i} in the data set D{\mathcal{}D}, given a program pp With complexity measure logπ(p)\log\pi(p), per-example loss function logπ(oi|pσi)\log\pi(o_{i}|\llbracket p\rrbracket\sigma_{i}) (given example (σi,oi)(\sigma_{i},o_{i})), and the following loss function:

L(p,D)=(σi,oi)Dlogπ(oi|pσi)){\mathcal{}L}(p,{\mathcal{}D})=\sum_{(\sigma_{i},o_{i})\in{\mathcal{}D}}\log\pi(o_{i}|\llbracket p\rrbracket\sigma_{i}))

the technique in Section 7.1 (Algorithm 1) synthesizes the most probable program pp^{*}

At Most kk Wrong: Consider a setting in which, given a data set, a random procedure is allowed to corrupt at most kk of these input-output examples. Given this noisy data set D{\mathcal{}D}, our task is to synthesize the simplest program pp^{*} which is wrong on at most kk of these input-output examples. Formally, given data set D{\mathcal{}D}, bound kk, DSL GG, and a complexity measure CC:

p=𝖺𝗋𝗀𝗆𝗂𝗇pGC(p) s.t. L0/1(p,D)kp^{*}=\mathsf{argmin}_{p\in G}C(p)\text{ s.t. }{\mathcal{}L}_{0/1}(p,{\mathcal{}D})\leq k

where L0/1{\mathcal{}L}_{0/1} is the 0/1 loss function. The best program for a given accuracy framework (subsection 7.2) allows us to synthesize pp^{*} subject to a threshold dd.

9. Experimental Results

String transformations have been extensively studied within the Programming by Example community (Gulwani, 2011; Polozov and Gulwani, 2015; Singh and Gulwani, 2016). We implemented our technique (in 6k lines of Java code) and used it to solve benchmark program synthesis problems from the SyGuS 2018 benchmark suite (Alur et al., 2013). This benchmark suite contains a range of string transformation problems, a class of problems that has been extensively studied in past program synthesis projects (Gulwani, 2011; Polozov and Gulwani, 2015; Singh and Gulwani, 2016).

We use the DSL from (Wang et al., 2017a) (Figure 6) with the size complexity measure Size(p)\mathrm{Size}(p). The DSL supports extracting and contatenating (𝖢𝗈𝗇𝖼𝖺𝗍\mathsf{Concat}) substrings of the input string xx; each substring is extracted using the 𝖲𝗎𝖻𝖲𝗍𝗋\mathsf{SubStr} function with a start and end position. A position can either be a constant index (𝖢𝗈𝗇𝗌𝗍𝖯𝗈𝗌\mathsf{ConstPos}) or the start or end of the kthk^{th} occurrence of the match token τ\tau in the input string (𝖯𝗈𝗌\mathsf{Pos}).

String expre:=𝖲𝗍𝗋(f)|𝖢𝗈𝗇𝖼𝖺𝗍(f,e);Substring exprf:=𝖢𝗈𝗇𝗌𝗍𝖲𝗍𝗋(s)|𝖲𝗎𝖻𝖲𝗍𝗋(x,p1,p2);Positionp:=𝖯𝗈𝗌(x,τ,k,d)|𝖢𝗈𝗇𝗌𝗍𝖯𝗈𝗌(k)Directiond:=𝖲𝗍𝖺𝗋𝗍|𝖤𝗇𝖽;\begin{array}[]{rcl}\text{String expr}~{}~{}~{}~{}e&:=&\mathsf{Str}(f)~{}|~{}\mathsf{Concat}(f,e);\\ \text{Substring expr}~{}~{}~{}~{}f&:=&\mathsf{ConstStr}(s)~{}|~{}\mathsf{SubStr}(x,p_{1},p_{2});\\ \text{Position}~{}~{}~{}~{}p&:=&\mathsf{Pos}(x,\tau,k,d)~{}|~{}\mathsf{ConstPos}(k)\\ \text{Direction}~{}~{}~{}~{}d&:=&\mathsf{Start}~{}|~{}\mathsf{End};\\ \end{array}
Figure 6. DSL for string transformation, τ\tau is a token, kk is an integer, and ss is a string constant

9.1. Implementation Optimizations

Instead of computing individual SFTAs for each input-output example, then combining the SFTAs via + intersections to obtain the final SFTA, our implementation computes the final SFTA directly working over the full data set. The implementation also applies two techniques that constrain the size of the final SFTA. First, it bounds the number of recursive applications of the production e:=𝖢𝗈𝗇𝖼𝖺𝗍(f,e)e:=\mathsf{Concat}(f,e) by applying a bounded scope height threshold dd. Second, during construction of the SFTA, a state with symbol ee is only added to the SFTA if the length of the state’s output value is not greater than the length of the output string plus one.

9.2. Scalability

Threshold 1 2 3 4
Benchmark Name time(sec) SFTA size time(sec) SFTA size time(sec) SFTA size time(sec) SFTA size
bikes 0.16 1.08 0.73 10.56 4.72 56.4 19.83 145.8
bikes-long 0.21 1.02 1.37 9.42 6.04 49.9 26.99 139.35
bikes-long-repeat 0.18 1.02 1.06 9.42 6.03 49.9 27.47 139.35
bikes-short 0.15 1.08 0.79 10.56 3.98 56.4 18.62 145.8
dr-name X X 7.54 107.5 107.18 1547.2 - -
dr-name-long X X 17.4 70.28 300.9 1077.6 - -
dr-name-long-repeat X X 19.15 70.28 301.3 1077.6 - -
dr-name-short X X 10.2 107.5 101.5 154.8 - -
firstname 0.28 1.02 1.46 4.34 4.024 4.33 3.97 4.34
firstname-long 1.72 1.04 12.03 4.36 39.08 4.36 41.22 4.36
firstname-long-repeat 1.64 1.04 13.96 4.36 42.4 4.36 43.1 4.36
firstname-short 0.26 1.02 1.47 4.37 3.93 4.34 3.9 4.34
initials X X X X 8.7 42.3 30.4 42.34
initials-long X X X X 86.44 42.36 376.56 42.36
initials-long-repeat X X X X 86.23 42.36 386.25 42.36
initials-short X X X X 8.92 42.34 31.72 42.34
lastname 0.43 2.56 4.78 28.3 27.29 208.35 159.41 741.44
lastname-long 1.93 1.37 15.1 11.34 112.04 50.81 485.98 50.8
lastname-long-repeat 1.85 1.37 18.35 11.34 113.36 50.81 486.35 50.8
lastname-short 0.6 2.56 3.07 28.3 28.3 208.35 160.54 741.44
name-combine X X 8.49 269.9 224.074 7485.83 - -
name-combine-long X X 32.28 161.54 - - - -
name-combine-long-repeat X X 98.46 299 - - - -
name-combine-short X X 6.5 269.9 207.86 7485.83 - -
name-combine-2 X X X X 63.490 855.34 - -
name-combine-2-long X X X X 591.6 851.44 - -
name-combine-2-long-repeat X X X X 592.0 851.44 - -
name-combine-2-short X X X X 57.26 855.34 - -
name-combine-3 X X X X 43.082 911.53 527.29 8104.7
name-combine-3-long X X X X 193.42 649.13 - -
name-combine-3-long-repeat X X X X 192.81 649.13 - -
name-combine-3-short X X X X 42.266 911.53 526.13 8104.7
reverse-name X X 6.9 269.9 217.19 7495.9 - -
reverse-name-long X X 29.55 161.53 - - - -
reverse-name-long-repeat X X 27.6 161.53 - - - -
reverse-name-short X X 6.84 269.9 228.24 7485.8 - -
phone 0.12 0.46 0.47 1.58 0.87 1.58 0.78 1.58
phone-long 0.8 0.46 3.9 1.58 7.79 1.58 32.79 1.58
phone-long-repeat 0.69 0.46 3.29 1.58 7.76 1.58 43.24 1.58
phone-short 0.12 0.46 0.37 1.58 0.804 1.578 4.97 1.58
phone-1 0.15 0.46 0.44 1.58 0.84 1.58 3.017 1.58
phone-1-long 0.99 0.46 3.8 1.58 8.23 1.58 16.58 1.58
phone-1-long-repeat 0.90 0.46 4.1 1.58 8.42 1.58 17.5 1.58
phone-1-short 0.14 0.46 0.45 1.58 0.8 1.58 1.5 1.58
phone-2 0.13 0.46 0.44 1.58 0.83 1.58 3.176 1.58
phone-2-long 0.64 0.46 2.84 1.58 8.36 1.58 16 1.58
phone-2-long-repeat 0.85 0.46 3.8 1.58 9.83 1.58 17.55 1.58
phone-2-short 0.09 0.46 0.47 1.58 0.83 1.58 2.78 1.58
phone-5 0.18 0.23 0.16 0.23 0.11 0.23 0.7 0.23
phone-5-long 1.24 0.23 0.94 0.23 0.75 0.23 4.2 0.23
phone-5-long-repeat 1.27 0.23 1.19 0.23 0.77 0.23 2.96 0.23
phone-5-short 0.17 0.23 0.17 0.23 0.11 0.23 0.9 0.23
phone-6 0.27 0.64 1.38 2.6 2.67 2.61 9.3 2.61
phone-6-long 1.84 0.64 6.48 2.6 24.66 2.61 103.3 2.61
phone-6-long-repeat 2.16 0.64 7.12 2.6 24.69 2.61 143.9 2.61
phone-6-short 0.28 0.64 0.76 2.6 2.27 2.61 11.19 2.61
phone-7 0.24 0.64 1.04 2.6 2.87 2.61 11.141 2.61
phone-7-long 2.6 0.64 7.8 2.6 26.1 2.61 108.1 2.61
phone-7-long-repeat 2.58 0.64 6.68 2.6 26.15 2.61 115.42 2.61
phone-7-short 0.23 0.64 1.13 2.6 3.26 2.61 10.71 2.61
phone-8 0.23 0.64 1 2.6 2.65 2.61 8.51 2.61
phone-8-long 2.33 0.64 7.58 2.6 25.87 2.61 114.54 2.61
phone-8-long-repeat 1.67 0.64 7.7 2.6 25.45 2.61 128.3 2.61
phone-8-short 0.27 0.64 0.97 2.6 2.45 2.61 13.81 2.61
Figure 7. Runtimes and SFTA sizes for selected SyGuS 2018 benchmarks

We evaluate the scalability of our implementation by applying it to all problems in the SyGuS 2018 benchmark suite (SyG, 2018). For each problem we use the clean (noise-free) data set for the problem provided with the benchmark suite. We use the lexicographic objective function UL(l,c)U_{L}(l,c) with the 0/0/\infty loss function and the c=Size(p)c=\mathrm{Size}(p) complexity measure. We run each benchmark with bounded scope height threshold d=d= 1, 2, 3, and 4 and record the running time on that benchmark problem and the number of states in the SFTA. A state with symbol ee is only added to the SFTA if the length of its output value is not greater than the length of the output string.

Because the running time of our technique does not depend on the specific utility function (except for the time required to evaluate the utility function, which is typically negligible for most utility functions, and except for search space pruning techniques appropriate for specific combinations of utility functions and DSLs), we anticipate that these results will generalize to other utility functions. All experiments are run on an 3.00 GHz Intel(R) Xeon(R) CPU E5-2690 v2 machine with 512GB memory running Linux 4.15.0. With a timeout limit of 10 minutes and bounded scope height threshold of 4, the implementation is able to solve 64 out of the 108 SyGuS 2018 benchmark problems. For the remaining 48 benchmark problems a correct program does not exist within the DSL at bounded scope height threshold 4.

Table 7 presents results for selected SyGuS 2018 benchmarks. We omit all name-combine-4-*, phone-3-*, phone-4-*, phone-9-*, phone-10-*, and univ-* benchmarks — all runs for these benchmarks either do not synthesize a correct program or do not terminate. Table 7 presents results for all other SyGuS 2018 benchmarks. Our synthesis technique removes all duplicates from the dataset in case of 0/0/\infty loss. There is a row for each benchmark problem. The first column presents the name of the benchmark. The next four columns present results for the technique running with bounded scope height threshold d=d= 1, 2, 3, and 4, respectively. Each column has two subcolumns: the first presents the running time on that benchmark problem (in seconds); the second presents the number of states in the SFTA (in thousands of states). An entry X indicates that the implementation terminated but did not synthesize a correct program that agreed with all provided input-output examples. An entry - indicates that the implementation did not terminate.

In general, both the running times and the number of states in the SFTA increase as the number of provided input-output examples and/or the bounded height threshold increases. The SFTA size sometimes stops increasing as the height threshold increases. We attribute this phenomenon to the application of a search space pruning technique that terminates the recursive application of the production e:=𝖢𝗈𝗇𝖼𝖺𝗍(f,e);e:=\mathsf{Concat}(f,e); when the generated string becomes longer than the current output string — in this case any resulting synthesized program will produce an output that does not match the output in the data set.

We compare with a previous technique that uses FTAs to solve program synthesis problems (Wang et al., 2017b). This previous technique requires clean data and only synthesizes programs that agree with all input-output examples in the data set. Our technique builds SFTAs with similar structure, with additional overhead coming from the evaluation of the objective function. We obtained the implementation of the technique presented in (Wang et al., 2017b) and ran this implementation on all benchmarks in the SyGuS 2018 benchmark suite. The running times of our implementation and this previous implementation are comparable.

Threshold 1 2 3 4
Benchmark Name n output size time(sec) SFTA size DL Loss size time(sec) SFTA size DL Loss size time(sec) SFTA size DL Loss size time(sec) SFTA size DL Loss size
bikes 6 33 0.18 14.28 0 8 1.04 191.6 0 8 7.0 1532.67 0 8 49.06 6655.88 0 8
bikes-long 24 136 0.37 135.42 0 8 2.85 1695.02 0 8 25.45 13114.22 0 8 158.04 57398.12 0 8
bikes-long-repeat 58 325 0.68 135.46 0 8 5.81 1695.06 0 8 55.19 13114.26 0 8 364.4 57398.16 0 8
bikes-short 6 33 0.13 14.28 0 8 1.11 191.6 0 8 7.34 1532.67 0 8 54.77 6655.88 0 8
dr-name 4 36 0.49 63.76 4 11 8.66 1693.15 0 14 194.82 29685.59 0 14 - - - -
dr-name-long 50 515 1.24 356.55 50 11 22.5 9844.45 0 14 - - - - - - - -
dr-name-long-repeat 150 1545 2.34 3565.15 150 11 59.33 98444.15 0 14 - - - - - - - -
dr-name-short 4 36 0.55 63.76 4 11 8.53 1693.15 0 14 302.33 29685.59 0 14 - - - -
firstname 4 20 0.26 15.95 0 8 1.67 133.16 0 8 26.0 595.77 0 8 49.09 595.77 0 8
firstname-long 54 335 1.47 161.35 0 8 16.14 1333.45 0 8 251.45 5959.55 0 8 547.8 5959.55 0 8
firstname-long-repeat 204 1280 3.64 1613.2 0 8 47.74 13334.2 0 8 - - - - - - - -
firstname-short 4 20 0.23 15.95 0 8 1.61 133.16 0 8 27.41 595.77 0 8 45.07 595.77 0 8
initials 4 16 0.23 15.97 8 6 1.64 168.58 4 12 31.26 1450.97 0 22 117.93 5913.66 0 22
initials-long 54 216 1.23 143.25 108 6 14.83 1669.35 54 12 370.59 14493.25 0 22 - - - -
initials-long-repeat 204 816 3.14 1432.2 408 6 48.23 16693.2 204 12 - - - - - - - -
initials-short 4 16 0.22 15.97 8 6 1.61 168.58 4 12 19.0 1450.97 0 22 121.17 5913.66 0 22
lastname 4 30 0.42 38.48 0 10 4.45 591.56 0 10 59.1 5717.2 0 10 432.69 34502.07 0 10
lastname-long 54 356 1.48 186.05 0 10 18.5 2316.35 0 10 249.64 19128.05 0 10 - - - -
lastname-long-repeat 204 1334 3.69 1860.2 0 10 60.44 23163.2 0 10 - - - - - - - -
lastname-short 4 30 0.44 38.48 0 10 4.25 591.56 0 10 57.87 5717.2 0 10 481.84 34502.07 0 10
name-combine 6 81 0.44 69.97 6 8 9.55 3263.99 0 21 381.69 102428.84 0 21 - - - -
name-combine-long 50 691 1.53 546.75 50 8 41.16 20330.85 0 21 - - - - - - - -
name-combine-long-repeat 204 2818 8.85 9717.2 204 8 464.71 392615.2 0 21 - - - - - - - -
name-combine-short 6 81 0.45 69.97 6 8 10.23 3263.99 0 21 413.17 102428.84 0 21 - - - -
name-combine-2 4 32 0.5 49.25 4 11 6.43 1124.6 4 11 113.3 17414.8 0 24 - - - -
name-combine-2-long 54 497 2.03 407.45 54 11 47.53 9703.35 54 11 - - - - - - - -
name-combine-2-long-repeat 204 1892 5.35 4074.2 204 11 161.69 97033.2 204 11 - - - - - - - -
name-combine-2-short 4 32 0.51 49.25 4 11 6.5 1124.6 4 11 111.78 17414.8 0 24 - - - -
name-combine-3 6 56 0.33 38.94 12 13 4.12 984.52 6 16 78.62 16834.93 0 22 - - - -
name-combine-3-long 50 476 1.2 288.15 100 13 18.06 6511.15 50 16 325.26 110276.25 0 22 - - - -
name-combine-3-long-repeat 200 1904 2.53 2881.2 400 13 59.13 65111.2 200 16 - - - - - - - -
name-combine-3-short 6 56 0.34 38.94 12 13 3.98 984.52 6 16 74.39 16834.93 0 22 - - - -
name-combine-4 5 49 0.34 52.26 15 13 5.1 1679.88 10 16 139.73 36808.22 5 19 - - - -
name-combine-4-long 50 526 1.39 362.65 150 13 22.88 9825.05 100 16 533.41 201894.15 50 19 - - - -
name-combine-4-long-repeat 200 2104 3.15 3626.2 600 13 77.55 98250.2 400 16 - - - - - - - -
name-combine-4-short 5 49 0.36 52.26 15 13 4.92 1679.88 10 16 139.69 36808.22 5 19 - - - -
phone 6 18 0.13 7.35 0 6 0.55 48.79 0 6 2.37 162.2 0 6 6.16 162.2 0 6
phone-long 100 300 0.71 734.1 0 6 4.06 4878.1 0 6 26.93 16219.1 0 6 75.58 16219.1 0 6
phone-long-repeat 400 1200 1.51 734.4 0 6 13.93 4878.4 0 6 85.42 16219.4 0 6 251.59 16219.4 0 6
phone-short 6 18 0.1 7.35 0 6 0.55 48.79 0 6 2.31 162.2 0 6 5.47 162.2 0 6
phone-1 6 18 0.13 7.35 0 6 0.57 48.79 0 6 2.37 162.2 0 6 5.86 162.2 0 6
phone-1-long 100 300 0.76 734.1 0 6 4.22 4878.1 0 6 26.69 16219.1 0 6 76.31 16219.1 0 6
phone-1-long-repeat 400 1200 1.52 734.4 0 6 14.18 4878.4 0 6 86.62 16219.4 0 6 260.76 16219.4 0 6
phone-1-short 6 18 0.1 7.35 0 6 0.55 48.79 0 6 2.49 162.2 0 6 6.3 162.2 0 6
phone-2 6 18 0.12 7.35 0 8 0.57 48.79 0 8 2.16 162.2 0 8 6.37 162.2 0 8
phone-2-long 100 300 0.69 734.1 0 8 4.09 4878.1 0 8 26.24 16219.1 0 8 69.53 16219.1 0 8
phone-2-long-repeat 400 1200 1.55 734.4 0 8 17.56 4878.4 0 8 86.94 16219.4 0 8 242.99 16219.4 0 8
phone-2-short 6 18 0.12 7.35 0 8 0.56 48.79 0 8 2.05 162.2 0 8 5.78 162.2 0 8
phone-3 7 91 0.39 42.06 14 11 5.13 1826.9 14 11 202.67 59912.06 7 20 - - - -
phone-3-long 100 1300 1.54 4205.1 200 11 51.3 182689.1 200 11 - - - - - - - -
phone-3-long-repeat 400 5200 4.28 4205.4 800 11 182.03 182689.4 800 11 - - - - - - - -
phone-3-short 7 91 0.35 42.06 14 11 5.03 1826.9 14 11 201.6 59912.06 7 20 - - - -
phone-4 6 66 0.26 39.66 12 8 4.11 1498.16 6 17 126.55 42781.51 6 17 - - - -
phone-4-long 100 1100 1.44 3965.1 200 8 43.97 149815.1 100 17 - - - - - - - -
phone-4-long-repeat 400 4400 3.85 3965.4 800 8 152.39 149815.4 400 17 - - - - - - - -
phone-4-short 6 66 0.29 39.66 12 8 4.17 1498.16 6 17 125.07 42781.51 6 17 - - - -
phone-5 7 15 0.15 5.57 0 8 0.64 5.57 0 8 0.67 5.57 0 8 0.65 5.57 0 8
phone-5-long 100 240 1.05 556.1 0 8 4.77 556.1 0 8 5.28 556.1 0 8 4.27 556.1 0 8
phone-5-long-repeat 400 960 2.58 556.4 0 8 14.38 556.4 0 8 16.45 556.4 0 8 14.09 556.4 0 8
phone-5-short 7 15 0.15 5.57 0 8 0.63 5.57 0 8 0.62 5.57 0 8 0.67 5.57 0 8
phone-6 7 21 0.25 12.85 0 10 1.39 72.62 0 10 7.53 315.42 0 10 22.65 315.42 0 10
phone-6-long 100 300 1.44 1284.1 0 10 14.09 7261.1 0 10 72.39 31541.1 0 10 313.96 31541.1 0 10
phone-6-long-repeat 400 1200 3.98 1284.4 0 10 42.03 7261.4 0 10 260.06 31541.4 0 10 - - - -
phone-6-short 7 21 0.3 12.85 0 10 1.39 72.62 0 10 7.63 315.42 0 10 24.87 315.42 0 10
phone-7 7 21 0.22 12.85 0 10 1.34 72.62 0 10 6.87 315.42 0 10 24.32 315.42 0 10
phone-7-long 100 300 1.5 1284.1 0 10 12.99 7261.1 0 10 72.47 31541.1 0 10 304.19 31541.1 0 10
phone-7-long-repeat 400 1200 4.25 1284.4 0 10 44.31 7261.4 0 10 268.87 31541.4 0 10 - - - -
phone-7-short 7 21 0.22 12.85 0 10 1.41 72.62 0 10 7.2 315.42 0 10 22.27 315.42 0 10
phone-8 7 21 0.22 12.85 0 10 1.33 72.62 0 10 7.5 315.42 0 10 23.33 315.42 0 10
phone-8-long 100 300 1.45 1284.1 0 10 14.08 7261.1 0 10 82.64 31541.1 0 10 327.2 31541.1 0 10
phone-8-long-repeat 400 1200 3.96 1284.4 0 10 46.61 7261.4 0 10 266.52 31541.4 0 10 - - - -
phone-8-short 7 21 0.22 12.85 0 10 1.4 72.62 0 10 6.83 315.42 0 10 25.22 315.42 0 10
phone-9 7 99 0.86 163.06 21 8 43.75 12299.47 14 21 - - - - - - - -
phone-9-long 100 1440 5.19 16305.1 300 8 555.42 1229985.1 200 21 - - - - - - - -
phone-9-long-repeat 400 5760 17.01 16305.4 1200 8 - - - - - - - - - - - -
phone-9-short 7 99 0.9 163.06 21 8 41.61 12299.47 14 21 - - - - - - - -
phone-10 7 120 1.18 187.59 21 8 65.53 18600.06 14 21 - - - - - - - -
phone-10-long 100 1740 6.82 18758.1 300 8 - - - - - - - - - - - -
phone-10-long-repeat 400 6960 21.71 18758.4 1200 8 - - - - - - - - - - - -
phone-10-short 7 120 0.98 187.59 21 8 65.15 18600.06 14 21 - - - - - - - -
reverse-name 6 81 0.44 69.97 6 18 9.12 3263.99 0 21 - - - - - - - -
reverse-name-long 50 691 1.45 546.75 50 18 43.63 20330.85 0 21 - - - - - - - -
reverse-name-long-repeat 200 2764 4.07 5467.2 200 18 150.97 203308.2 0 21 - - - - - - - -
reverse-name-short 6 81 0.55 69.97 6 18 10.1 3263.99 0 21 370.83 102428.84 0 21 - - - -
univ-1 6 258 147.5 11954.27 12 8 - - - - - - - - - - - -
univ-1-long 20 699 98.37 22386.12 40 8 - - - - - - - - - - - -
univ-1-long-repeat 30 1000 132.04 22386.13 60 8 - - - - - - - - - - - -
univ-1-short 6 258 170.89 11954.27 12 8 - - - - - - - - - - - -
univ-2 6 243 99.43 4954.21 17 18 - - - - - - - - - - - -
univ-2-long 20 744 115.13 27793.82 65 18 - - - - - - - - - - - -
univ-2-long-repeat 30 1075 145.75 27793.83 98 18 - - - - - - - - - - - -
univ-2-short 6 243 106.17 4954.21 17 18 - - - - - - - - - - - -
univ-3 6 122 22.53 1134.63 5 20 - - - - - - - - - - - -
univ-3-long 20 378 30.38 4930.22 25 20 - - - - - - - - - - - -
univ-3-long-repeat 30 570 37.74 4930.23 45 20 - - - - - - - - - - - -
univ-3-short 6 122 25.99 1134.63 5 20 - - - - - - - - - - - -
univ-4 8 150 22.75 842.34 18 20 - - - - - - - - - - - -
univ-4-long 20 366 27.1 4510.42 39 20 - - - - - - - - - - - -
univ-4-long-repeat 30 552 34.4 4510.43 63 20 - - - - - - - - - - - -
univ-4-short 8 150 25.79 842.34 18 20 - - - - - - - - - - - -
univ-5 8 150 25.14 1005.5 18 20 - - - - - - - - - - - -
univ-5-long 20 366 32.39 5708.62 39 20 - - - - - - - - - - - -
univ-5-long-repeat 30 552 41.96 5708.63 63 20 - - - - - - - - - - - -
univ-5-short 8 150 31.34 1005.5 18 20 - - - - - - - - - - - -
univ-6 8 150 38.49 1171.16 18 20 - - - - - - - - - - - -
univ-6-long 20 366 36.38 6896.62 39 20 - - - - - - - - - - - -
univ-6-long-repeat 30 552 53.14 6896.63 63 20 - - - - - - - - - - - -
univ-6-short 8 150 35.31 1171.16 18 20 - - - - - - - - - - - -
Figure 8. Runtimes, SFTA sizes, Synthesized Program Loss, and its size for SyGuS 2018 benchmarks under DL Loss
Benchmark Data Set Number of Required
Size Correct Examples
1-Delete DL 0/1
bikes 6 0 0 2
dr-name 4 0 0 1
firstname 4 0 0 2
lastname 4 0 2 4
initials 4 0 2 2
reverse-name 6 0 0 2
name-combine 4 0 0 2
name-combine-2 4 0 0 2
name-combine-3 4 0 0 2
phone 6 0 2 3
phone-1 6 0 3 3
phone-2 6 0 2 3
phone-5 7 0 2 3
phone-6 7 0 2 3
phone-7 7 0 2 3
phone-8 7 0 0 1
Figure 9. Minimum number of correct examples required to synthesize correct a program.

9.3. Noisy Data Sets, Character Deletion

We next present results for our implementation running on small (few input-output examples) data sets with character deletions. We use a noise source that cyclically deletes a single character from each output in the data set in turn, starting with the first character, proceeding through the output positions, then wrapping around to the first character again. We apply this noise source to corrupt every output in the data set. To construct a noisy data set with kk correct (uncorrupted) outputs, we do not apply the noise source to the last kk outputs in the data set.

We exclude all *-long, *long-repeat, and *-short benchmarks and all benchmarks that do not terminate within the time limit at height bound 4. For each remaining benchmark we use our implementation and the generated corrupted data sets to determine the minimum number of correct outputs in the corrupted data set required for the implementation to produce a correct program that matches the original hidden clean data set on all input-output examples. We consider three loss functions: the 0/10/1 and DL loss functions (Section 3) and the following 1-Delete loss function, which is designed to work with noise sources that delete a single character from the output stream:

Definition 0.

1-Delete Loss Function: The 1-Delete loss function L1D(p,D){\mathcal{}L}_{1D}(p,{\mathcal{}D}) uses the per-example loss function LL that is 0 if the outputs from the synthesized program and the data set match exactly, 1 if a single deletion enables the output from the synthesized program to match the output from the data set, and \infty otherwise:

L1D(p,D)=(σi,oi)DL1D(pσi,oi), where {\mathcal{}L}_{1D}(p,{\mathcal{}D})=\sum\limits_{(\sigma_{i},o_{i})\in{\mathcal{}D}}L_{1D}(\llbracket p\rrbracket\sigma_{i},o_{i}),\mbox{ where }
L1D(o1,o2)={0o1=o21axb=o1ab=o2|x|=1 otherwiseL_{1D}(o_{1},o_{2})=\begin{cases}0&o_{1}=o_{2}\\ 1&a\cdot x\cdot b=o_{1}\wedge a\cdot b=o_{2}\wedge|x|=1\\ \infty&\text{ otherwise}\end{cases}

We use the lexicographic objective function UL(l,c)U_{L}(l,c) with c=Size(p)c=\mathrm{Size}(p) as the complexity measure and bounded scope height threshold d=4d=4. We apply a search space pruning technique that terminates the recursive application of the production e:=𝖢𝗈𝗇𝖼𝖺𝗍(f,e);e:=\mathsf{Concat}(f,e); when the generated string becomes more than one character longer than the current output string.

Table 9 summarizes the results. The Data Set Size Column presents the total number of input-output examples in the corrupted data set. The next three columns, 1-Delete, DL, and 0/1, present the minimum number of correct (uncorrupted) input-output examples required for the technique to synthesize a correct program (that agrees with the original hidden clean data set on all input-output examples) using the 1-Delete, DL, and 0/1 loss functions, respectively.

With the 1-Delete loss function, the minimum number of required correct input-output examples is always 0 — the implementation synthesizes, for every benchmark problem, a correct program that matches every input-output example in the original clean data set even when given a data set in which every output is corrupted. This result highlights how 1) discrete noise sources produce noisy outputs that leave a significant amount of information from the original uncorrupted output available in the corrupted output and 2) a loss function that matches the noise source can enable the synthesis technique to exploit this information to produce correct programs even in the face of substantial noise.

With the DL loss function, the implementation synthesizes a correct program for 8 of the 16 benchmarks when all outputs in the data set are corrupted. For 7 of the remaining 8 benchmarks the technique requires 2 correct input-output examples to synthesize the correct program. The remaining benchmark requires 3 correct examples. The general pattern is that the technique tends to require correct examples when the output strings are short. The synthesized incorrect programs typically use less of the input string.

These results highlight how the DL loss function still extracts significant useful information available in outputs corrupted with discrete noise sources. But in comparison with the 1-Delete loss function, the DL loss function is not as good a match for the character deletion noise source. The result is that the synthesis technique, when working with the DL loss function, works better with longer inputs, sometimes encounters incorrect programs that fit the corrupted data better, and therefore sometimes requires correct inputs to synthesize the correct program.

With the 0/1 loss function, the technique always requires at least one and usually more correct inputs to synthesize the correct program. In contrast to the 1-Delete and DL loss functions, the 0/1 loss function does not extract information from corrupted outputs. To synthesize a correct program with the 0/1 loss function in this scenario, the technique must effectively ignore the corrupted outputs to synthesize the program working only with information from the correct outputs. It therefore always requires at least one and usually more correct outputs before it can synthesize the correct program.

9.4. Noisy Data Sets, Character Replacements

We next present results for our implementation running on larger data sets with character replacements. The phone-*-long-repeat benchmarks within the SyGuS 2018 benchmarks contain transformations over phone numbers. The data sets for these benchmarks contain 400 input-output examples, including repeated input-output examples.

For each of these phone-*-long-repeat benchmark problems on which our technique terminates with bounded scope height threshold 4 (Section 9.2), we construct a noisy data set as follows. For each digit in each output string in the data set, we flip a biased coin. With probability bb, we replace the digit with a uniformly chosen random digit (so that each digit in the noisy output is not equal to the original digit in the uncorrupted output with probability 9/10×b9/10\times b).

We then run our implementation on each benchmark problem with the noisy data set using the tradeoff objective function Uλ(l,c)=l+λ×cU_{\lambda}(l,c)=l+\lambda\times c with complexity measure c=Size(p)c=\mathrm{Size}(p) and the following nn-Substitution loss function:

Definition 0.

nn-Substitution Loss Function:

The nn-Substitution loss function LnS(p,D){\mathcal{}L}_{nS}(p,{\mathcal{}D}) uses the per-example loss function LnSL_{nS} that counts the number of positions where the noisy output does not agree with the output from the synthesized program. If the synthesized program produces an output that is longer or shorter than the output in the noisy data set, the loss function is \infty:

LnS(p,D)=(σi,oi)DLnS(pσi,oi), where {\mathcal{}L}_{nS}(p,{\mathcal{}D})=\sum\limits_{(\sigma_{i},o_{i})\in{\mathcal{}D}}L_{nS}(\llbracket p\rrbracket\sigma_{i},o_{i}),\mbox{ where }
LnS(o1,o2)={|o1||o2|i=1|o1|1 if o1[i]o2[i] else 0|o1|=|o2|L_{nS}(o_{1},o_{2})=\begin{cases}\infty&|o_{1}|\neq|o_{2}|\\ \sum\limits_{i=1}^{|o_{1}|}1\mbox{ if }o_{1}[i]\neq o_{2}[i]\mbox{ else }0&|o_{1}|=|o_{2}|\end{cases}

We run the implementation for all combinations of the bounded scope threshold b{0.2,0.4,0.6}b\in\{0.2,0.4,0.6\} and λ{0.001,0.1}\lambda\in\{0.001,0.1\}. For every combination of bb and λ\lambda, and for every one of the phone-*-long-repeat benchmarks in the SyGuS 2018 benchmark set, the implementation synthesizes a correct program that produces the same outputs as in the original (hidden) clean data set.

These results highlight, once again, the ability of our technique to work with loss functions that match the characteristics of discrete noise sources to synthesize correct programs even in the face of substantial noise.

Benchmark Data set DL Output Program
Size Loss Size Size
name-combine-4 5 10 49 16
phone-3 7 14 91 11
phone-4 6 6 66 17
phone-9 7 14 99 21
phone-10 7 14 120 21
Figure 10. Approximate program synthesis with DL loss.

9.5. Approximate Program Synthesis

For the benchmarks in Table 10, a correct program does not exist within the DSL at bounded scope threshold 2. Table 10 presents results from our implementation on the clean (noise-free) benchmark data sets with the DL loss function, 𝖲𝗂𝗓𝖾(p)\mathsf{Size}(p) complexity measure, lexicographic objective function UL(LDL(p,D),Size(p))U_{L}({\mathcal{}L}_{DL}(p,{\mathcal{}D}),\mathrm{Size}(p)), and bounded scope threshold 2. The first column presents the name of the benchmark. The next four columns present the number of input-output examples in the benchmark data set, the DL loss incurred by the synthesized program over the entire data set, the sum of the lengths of the output strings of the data set (the DL loss for an empty output would be this sum), and the size of the synthesized program.

For the phone-* benchmarks, a correct program outputs the entire input telephone number but changes the punctutation, for example by including an area code in parentheses. The synthesized approximate programs correctly preserve the telephone number but apply only some of the punctuation changes. The result is 2=14/72=14/7 characters incorrect per output for all but phone-4, which has 1 character per output incorrect. Each output is between 13=91/713=91/7 and 17=120/717=120/7 characters long. For name-combine-4, the synthesized approximate program correctly extracts the last name, inserts a comma and a period, but does not extract the initial of the first name. These results highlight the ability of our technique to approximate a correct program when the correct program does not exist in the program search space.

Table 9.2 presents results from our implementation on the clean (noise-free) benchmark data sets in SyGuS 2018. The first column presents the name of the benchmark. The next column presents the number of input-output examples in the given benchmark. The next column presents the sum of the lengths of the output strings of the data set. The next four columns present results for the technique running with bounded scope height threshold d=d= 1, 2, 3, and 4, respectively. Each column has four subcolumns: the first presents the running time on that benchmark problem (in seconds). The second presents the number of states in the SFTA (in thousands of states). The third presents the DL loss of the synthesized program over the entire dataset (compare this DL loss with the sum of the output lengths over the data set). The fourth presents the size of the synthesized program. An entry - indicates that the implementation did not terminate.

9.6. Discussion

Practical Applicabilty: The experimental results show that our technique is effective at solving string manipulation program synthesis problems with modestly sized solutions like those present in the SyGuS 2018 benchmarks. More specifically, the results highlight how the combination of structure from the DSL, a discrete noise source that preserves some information even in corrupted outputs, and a good match between the loss function and noise source can enable very effective synthesis for data data sets with only a handful of input-output examples even in the presence of substantial noise. Even with as generic a loss function as the 0/1 loss function, the technique is effective at dealing with data sets in which a significant fraction of the outputs are corrupted. We anticipate that these results will generalize to similar classes of program synthesis problems with modestly sized solutions within a tractable and focused class of computations.

We note that our current implementation does not scale to SyGuS 2018 benchmarks with larger solutions. These benchmarks were designed to test the scalability of current and future program synthesis systems. No currently extant program analysis system of which we are aware can solve these larger problems.

To the extent that the SyGuS 2018 bencharks accurately represent the kinds of program synthesis problems that will be encountered in practice, our results provide encouraging evidence that our technique can help program synthesis systems work effectively with noisy data sets. Important future work in this area will more fully investigate interactions between the DSL, the noise source, the loss function, the classes of synthesis problems that occur in practice, and the scalability of the synthesis technique. A full evaluation of the immediate practical applicability of program synthesis for noisy data sets, as well as a meaningful evaluation of program synthesis more generally, awaits this future work.

Noise Sources With Different Characteristics: Our experiments largely consider discrete noise sources that preserve some information in corrupted outputs. The results highlight how loss functions like the 1-Delete, DL, and nn-Substitution loss functions can enable our technique to extract and exploit this preserved information to enhance the effectiveness of the synthesis. The question may arise how well may our technique perform with noise sources that leave little or even no information intact in corrupted outputs? Here the results from the 0/10/1 loss function, which does not aspire to extract any information from corrupted inputs, may be relevant — if the corrupted outputs considered together do not conform to a target computation in the DSL, the technique will, in effect, ignore these corrupted outputs to synthesize the program based on any remaining uncorrupted outputs. A final possibility is that the noise source may systematically produce outputs characteristic of a valid but incorrect computation. Here we would expect the algorithm to require a balance of correct outputs before it would be able to synthesize the correct program.

10. Related Work

The problem of learning programs from a set of input-output examples has been studied extensively (Shaw, 1975; Gulwani, 2011; Singh and Gulwani, 2016). These techniques can be largely broken down into the following four categories:

Synthesis Using Solvers: These systems require the user to provide precise semantics for the operators for the DSL they are using (Itzhaky et al., 2010). Our technique, in contrast, only requires black-box implementations of these operators. A large class of these systems depend on solvers which do not scale as the number of examples increases. Since our techniques are based on tree automata, our cost linearly increases as the number of examples increase. These systems require all input-output examples to be correct and only synthesize programs that match all input-output examples.

Enumerative Techniques: These techniques search the space of programs to find a single program that is consistent with the given examples (Feser et al., 2015; Osera and Zdancewic, 2015). Specifically, they enumerate all programs in the given DSL and terminate when they find the correct program. These techniques may apply different heuristics/techniques to prune the search space/speed up this process (Osera and Zdancewic, 2015). These techniques require all input-output examples to be correct and only synthesize programs that match all input-output examples.

VSA-based/Tree Automata-based Techniques: These techniques build complex data structures representing all possible programs compatible with the given examples (Singh and Gulwani, 2016; Polozov and Gulwani, 2015; Wang et al., 2017b). Current work requires users to provide correct input-output examples. Our work modifies these techniques to handle noisy data and to synthesize approximate programs that minimize an objective function over the provided potentially noisy data set.

Neural Program Synthesis/ML Approaches: There is extensive work that uses machine learning/deep neural networks to synthesize programs (Raychev et al., 2016; Devlin et al., 2017; Balog et al., 2016). These techniques require a training phase and a differentable loss function. Our technique requires no training phase and can work with arbitrary loss functions including, for example, the 0/10/1 loss function. Machine learning techniques are incompatible with this type of loss function. These systems provide no guarantees over the completeness and the optimality of their result, whereas our technique, due to its property of exploring all programs of size less than a threshold, always finds a program within the bounded scope that minimizes the objective function.

Data Set Sampling or Cleaning: There has been recent work which aspires to clean the data set or pick representative examples from the data set for synthesis (Gulwani, 2011; Raychev et al., 2016; Pu et al., 2017), for example by using machine learning or data cleaning to select productive subsets of the data set over which to perform exact synthesis. In contrast to these techniques, our proposed techniques 1) provide deterministic guarantees (as opposed to either probabilistic guarantees as in (Raychev et al., 2016) or no guarantees at all as in (Pu et al., 2017; Gulwani, 2011)), 2) do not require the use of oracles as in (Raychev et al., 2016), 3) can operate successfully even on data sets in which most or even all of the input-output examples are corrupted, and 4) do not require the explicit selection of a subset of the data set to drive the synthesis as in (Gulwani, 2011; Raychev et al., 2016).

Active Learning: Recent research exploits the availability of a reference implementation to use active learning for program synthesis (Shen and Rinard, 2019). Our technique, in contrast, works directly from given input-output examples with no reference implementation.

11. Conclusion

Dealing with noisy data is a pervasive problem in modern computing environments. Previous program synthesis systems target data sets in which all input-output examples are correct to synthesize programs that match all input-output examples in the data set.

We present a new program synthesis technique for working with noisy data and/or performing approximate program synthesis. Using state-weighted finite tree automata, this technique supports the formulation and solution of a variety of new program synthesis problems involving noisy data and/or approximate program synthesis. The results highlight how this technique, by exploiting information from a variety of sources — structure from the underlying DSL, information left intact by discrete noise sources — can deliver effective program synthesis even in the presence of substantial noise.

Acknowledgements.
This research was supported, in part, by the Boeing Corporation and DARPA Grants N6600120C4025 and HR001120C0015.

References

  • (1)
  • SyG (2018) 2018. SyGuS 2018 String Benchmark Suite. https://github.com/SyGuS-Org/benchmarks/tree/master/comp/2019/PBE_SLIA_Track/from_2018. Accessed: 2020-07-18.
  • Alur et al. (2013) Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In 2013 Formal Methods in Computer-Aided Design. IEEE, 1–8.
  • Balog et al. (2016) Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2016. Deepcoder: Learning to write programs. arXiv preprint arXiv:1611.01989 (2016).
  • Bishop (2006) Christopher M Bishop. 2006. Pattern recognition and machine learning. springer.
  • Damerau (1964) Fred J. Damerau. 1964. A Technique for Computer Detection and Correction of Spelling Errors. Commun. ACM 7, 3 (March 1964), 171–176. https://doi.org/10.1145/363958.363994
  • Devlin et al. (2017) Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. Robustfill: Neural program learning under noisy I/O. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 990–998.
  • Feng et al. (2017) Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-based synthesis of table consolidation and transformation tasks from examples. In ACM SIGPLAN Notices, Vol. 52. ACM, 422–436.
  • Feser et al. (2015) John K Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-output examples. In ACM SIGPLAN Notices, Vol. 50. ACM, 229–239.
  • Gallo et al. (1993) Giorgio Gallo, Giustino Longo, Stefano Pallottino, and Sang Nguyen. 1993. Directed hypergraphs and applications. Discrete applied mathematics 42, 2-3 (1993), 177–201.
  • Gulwani (2011) Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In ACM Sigplan Notices, Vol. 46. ACM, 317–330.
  • Itzhaky et al. (2010) Shachar Itzhaky, Sumit Gulwani, Neil Immerman, and Mooly Sagiv. 2010. A simple inductive synthesis methodology and its applications. In ACM Sigplan Notices, Vol. 45. ACM, 36–46.
  • Osera and Zdancewic (2015) Peter-Michael Osera and Steve Zdancewic. 2015. Type-and-example-directed program synthesis. ACM SIGPLAN Notices 50, 6 (2015), 619–630.
  • Polozov and Gulwani (2015) Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: a framework for inductive program synthesis. In ACM SIGPLAN Notices, Vol. 50. ACM, 107–126.
  • Pu et al. (2017) Yewen Pu, Zachery Miranda, Armando Solar-Lezama, and Leslie Pack Kaelbling. 2017. Selecting representative examples for program synthesis. arXiv preprint arXiv:1711.03243 (2017).
  • Raychev et al. (2016) Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In ACM SIGPLAN Notices, Vol. 51. ACM, 761–774.
  • Shaw (1975) D Shaw. 1975. Inferring LISP Programs From Examples.
  • Shen and Rinard (2019) Jiasi Shen and Martin C Rinard. 2019. Using active learning to synthesize models of applications that access databases. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 269–285.
  • Singh and Gulwani (2016) Rishabh Singh and Sumit Gulwani. 2016. Transforming spreadsheet data types using examples. In Acm Sigplan Notices, Vol. 51. ACM, 343–356.
  • Wang et al. (2017a) Xinyu Wang, Isil Dillig, and Rishabh Singh. 2017a. Program synthesis using abstraction refinement. Proceedings of the ACM on Programming Languages 2, POPL (2017), 63.
  • Wang et al. (2017b) Xinyu Wang, Isil Dillig, and Rishabh Singh. 2017b. Synthesis of data completion scripts using finite tree automata. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 62.
  • Yaghmazadeh et al. (2016) Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing transformations on hierarchically structured data. In ACM SIGPLAN Notices, Vol. 51. ACM, 508–521.