-Поиск по дневнику

Поиск сообщений в rss_planet_mozilla

 -Подписка по e-mail

 

 -Постоянные читатели

 -Статистика

Статистика LiveInternet.ru: показано количество хитов и посетителей
Создан: 19.06.2007
Записей:
Комментариев:
Написано: 7


Nick Fitzgerald: Synthesizing Loop-Free Programs with Rust and Z3

Понедельник, 13 Января 2020 г. 11:00 + в цитатник

Automatically finding a program that implements a given specification is called program synthesis. The main difficulty is that the search space is huge: the number of programs of size \(n\) grows exponentially. Na"ively enumerating every program of size \(n\), checking whether each one satisfies the specification, and then moving on to programs of size \(n+1\) and so on doesn’t scale. However, the field has advanced by using smarter search techniques to prune the search space, leveraging performance improvements in SMT solvers, and at times limiting the scope of the problem.

In this post, I’ll explain one approach to modern program synthesis: counterexample-guided iterative synthesis of component-based, loop-free programs, as described in Synthesis of Loop-Free Programs by Gulwani et al. We’ll dissect exactly what each of those terms mean, and we’ll also walk through an implementation written in Rust that uses the Z3 solver.

My hopes for this post are two-fold:

  1. I hope that people who are unfamiliar with program synthesis — just like I was not too long ago — get a little less unfamiliar and learn something new about the topic. I’ve tried to provide many examples, and break down the dense logic formulas from the paper into smaller, approachable pieces.

  2. I hope that folks who are already familiar with this kind of program synthesis can help me diagnose some performance issues in the implementation, where I haven’t been able to reproduce the synthesis results reported in the literature. For some of the more difficult benchmark problems, the synthesizer fails to even find a solution before my patience runs out.

Table of Contents

Motivation

Why write a program that writes other programs for me? Am I just too lazy to write them myself? Of course I am. However, there are many valid reasons why a person who is not as lazy as I am might want to synthesize programs.

Some programs are quite tricky to write correctly by hand, and a program synthesizer might succeed where you or I might fail. Quick! How do you isolate the rightmost zero bit in a word using only three bit manipulation instructions?!

              ,--- The rightmost zero bit.
              |
              V
Input:  011010011

Output: 000000100
              ^
              |
              '--- Only that bit is set.

Did you get it yet?

Okay, here’s the answer:

isolate_rightmost_zero_bit(x): // x = 011010011
    a <- not x                  // a = 100101100
    b <- add 1, x               // b = 011010100
    c <- and a, b               // c = 000000100
    return c

Our program synthesizer will find a solution in under a second, and that minimal-length solution in a minute or so. It would take me quite a while longer than that to do the same by hand. We’ll return to this problem throughout the rest of this post, and use it as a running example.

Another reason to use a program synthesizer might be that we need to write many more programs than we have time to write by hand. Take for example a compiler’s peephole optimizer: it considers a sliding window of instruction sequences, and for each sequence, it checks if it knows of an equivalent-but-faster-or-smaller instruction sequence. When it does know of a better instruction sequence, it replaces the original instructions with the better ones.

Peephole optimizers are typically constructed from pattern-matching rules that identify suboptimal instruction sequences paired with the improved instruction sequence to replace matches with:

new PeepholeOptimizer(
    pattern0 -> replacement0
    pattern1 -> replacement1
    pattern2 -> replacement2
    // ...
    patternn -> replacementn
)

Each replacementi is a little, optimized mini-program. If we were writing a new peephole optimizer from scratch and by hand, we would have to write \(n\) optimized mini-programs ourselves. And \(n\) can be big: LLVM’s InstCombine peephole optimizer has over 1,000 pattern-and-replacement pairs. Even half that many is way more than I want to write myself.

Instead of writing those optimized mini-programs by hand, we can use each original instruction sequence as a specification, feed it into a program synthesizer, and see if the synthesizer can find the optimal instruction sequence that does the same thing. Finally, we can use all these original instruction sequences and their synthesized, optimal instruction sequences as pattern-and-replacement pairs to automatically construct a peephole optimizer! This idea was first proposed by Bansal et al in Automatic Generation of Peephole Superoptimizers.

Edit: John Regehr pointed out to me that this idea has been floating around since much earlier than 2006, when the Bansal et al paper was published. He pointed me to The Design and Application of a Retargetable Peephole Optimizer by Davidson et al from 1980 as an example, but noted that even this wasn’t the first time it came up.

An Overview of Our Task

Program synthesis is the act of taking a specification, and automatically finding a program that satisfies it. In order to make the problem a little more tractable, we’re limiting its scope in two ways:

  1. Loop-free: We are only synthesizing programs without loops.

  2. Component-based: We are only synthesizing programs that can be expressed as the composition of a given library of components.

The loop-free limitation is not very limiting for many use cases. For example, peephole optimizers often don’t consider instruction sequences that span across loop boundaries.

Component-based synthesis means that rather than synthesizing programs using any combination of any number of the target language’s expressions, the synthesizer is given a library of components and synthesizes programs that use each of those components exactly once. The synthesizer rearranges the components, rewiring their inputs and outputs, until it finds a configuration that satisfies the specification.

That is, given a library of \(N\) components, it constructs a program of the form

synthesized_program(inputs...):
    temp0 <- component0(params0...)
    temp1 <- component1(params1...)
    // ...
    tempN-1 <- componentN-1(paramsN-1...)
    return tempN-1

where each parameter in paramsi is either a tempj variable defined earlier in the program or one of the original inputs.

For example, given the two components

  • f(a)
  • g(a, b)

and an input parameter x, the synthesizer can construct any of the following candidate programs (implicitly returning the variable defined last):

a <- g(x, x)
b <- f(x)

or

a <- g(x, x)
b <- f(a)

or

a <- f(x)
b <- g(x, x)

or

a <- f(x)
b <- g(a, x)

or

a <- f(x)
b <- g(x, a)

or

a <- f(x)
b <- g(a, a)

That’s it. That’s all of the programs it can possibly construct given just those two components.

The synthesizer cannot construct the following program, because it doesn’t use every component:

a <- f(x)

And the synthesizer cannot construct this program, because it uses the f component more than once:

a <- f(x)
b <- f(a)
c <- g(b, b)

And, finally, it cannot construct this last program, because this last program uses some function h that is not a component in our given library:

a <- f(x)
b <- h(a, x)

The following table describes some of the properties of component-based synthesis by comparing it to the fully general version of program synthesis:

General Synthesis Component-Based Synthesis
Shape of Synthesized Programs Using any number of any of the target language's expressions Using only the components in the library
Size of Synthesized Programs Varies Exactly the size of the library, since each component in the library is used exactly once

In our synthesizer, the components will be functions over fixed bit-width integers (also known as “bitvectors” in the SMT solver parlance) and they will correspond to a single instruction in our virtual instruction set: add, and, xor, etc. But in principle they could also be higher-level functions or anything else that we can encode in SMT queries. More on SMT queries later.

While component-based synthesis makes the synthesis problem easier, it does foist a decision on us each time we invoke the synthesizer: we must choose the library of available components. Each component is used exactly once in the synthesized program, but if we want to synthesize a program that performs multiple additions, we can include multiple instances of the add component in the library. Too few components, and the synthesizer might not be able to find a solution. Too many components will slow down the synthesizer, and let it generate non-optimal programs that potentially contain dead code.

To summarize, in component-based synthesis of loop-free programs, our synthesizer’s inputs are

  • a specification, and

  • a library of components.

Its output is a program that satisfies the specification, expressed in terms of the given components, or an error if can’t find such a program.

Formalizing the Problem

In order to synthesize a program, we need a specification describing the desired program’s behavior. The specification is a logical expression that describes the output when the program is given these inputs.

We define the specification with:

  • \(\vec{I}\) as the program inputs,

  • \(O\) as the program output, and

  • \(\phi_\mathrm{spec}(\vec{I}, O)\) as the expression relating the inputs to the output. This expression should be true when \(O\) is the desired output of running the program on inputs \(\vec{I}\).

The library of components we’re given is a multi-set of specifications describing each component’s behavior. Each component specification comes with how many inputs it takes (e.g. an add(a, b) component takes two inputs, and a not(a) component takes one input) as well as a logical formula relating the component’s inputs to its output. The component inputs, output, and expression all have similar notation to the program specification, but with a subscript:

  • \(\vec{I}_i\) is the \(i^\mathrm{th}\) component’s input variables,

  • \(O_i\) is the \(i^\mathrm{th}\) component’s output variable, and

  • \(\phi_i(\vec{I}_i, O_i)\) is the logical expression relating the \(i^\mathrm{th}\) component’s inputs with its output.

We define \(N\) as the number of components in the library.

For our isolating-the-rightmost-zero-bit example, what is the minimal components library we could give to the synthesizer, while still preserving its ability to find our desired solution? It would be a library consisting of exactly the components that correspond to each of the three instructions in the solution program: a not, an add1, and an and component.

Component Definition Description
\( \phi_0(I_0, O_0) \) \( O_0 = \texttt{bvadd}(1, I_0) \) The add-one operation on bitvectors.
\( \phi_1(I_1, I_2, O_1) \) \( O_1 = \texttt{bvand}(I_1, I_2) \) The bitwise and operation on bitvectors.
\( \phi_2(I_3, O_2) \) \( O_0 = \texttt{bvnot}(I_3) \) The bitwise not operation on bitvectors.

Program synthesis can be expressed as an exists-forall problem: we want to find whether there exists some program \(P\) that satisfies the specification for all inputs given to it and outputs it returns.

\( \begin{align} & \exists P: \\ & \quad \forall \vec{I},O: \\ & \quad \quad P(\vec{I}) = O \implies \phi_\mathrm{spec}(\vec{I}, O) \end{align} \)

Let’s break that down and translate it into English:

\( \exists P \) There exists some program \(P\), such that
\( \forall \vec{I},O \) for all inputs \(\vec{I}\) and output \(O\),
\( P(\vec{I}) = O \) if we run the program on the inputs \(\vec{I}\) to get the output \(O\),
\( \implies \) then
\( \phi_\mathrm{spec}(\vec{I}, O) \) our specification \(\phi_\mathrm{spec}\) is satisfied.

This exists-forall formalization is important to understand because our eventual implementation will query the SMT solver (Z3 in our case) with pretty much this formula. It won’t be exactly the same:

  • \(P\) is an abstraction that’s hiding some details about components,
  • there are a few algebraic transformations we will perform, and
  • we won’t pose the whole problem to the solver in a single query all at once.

Nonetheless, the implementation follows from this formalization, and we won’t get far if we don’t have a handle on this.

A Brief Introduction to SMT Solvers

We can’t continue any further without briefly discussing SMT solvers and their capabilities. SMT solvers like Z3 take a logical formula, potentially containing unbound variables, and return whether it is:

  • Satisfiable: there is an assignment to the unbound variables that makes the assertions true, and also here is a model describing those assignments.

  • Unsatisfiable: the formula’s assertions are false; there is no assignment of values to the unbound variables that can make them true.

SMT solvers take their assertions in a Lisp-like input language called SMT-LIB2. Here is an example of a satisfiable SMT query:

;; `x` is some integer, but we don't know which one.
(declare-const Int x)

;; We do know that `x + 2 = 5`, however.
(assert (= 5 (+ x 2)))

;; Check whether the assertions are satisfiable. In
;; this case, they should be!
(check-sat)

;; Get the model, which has assignments to each of
;; the free variables. The model should report that
;; `x` is `3`!
(get-model)

Open and run this snippet in an online Z3 editor!

Note that even though there isn’t any \(\exists\) existential quantifier in there, the solver is implicitly finding a solution for \(x\) in \(\exists x: x + 2 = 5\), i.e. there exists some \(x\) such that \(x + 2\) equals 5. While some SMT solvers have some support for working with higher-order formulas with explicit \(\exists\) existential and \(\forall\) universal quantifiers nested inside, these modes tend to be much slower and also incomplete. We can only rely on first-order, implicitly \(\exists\) existential queries: that is, formulas with potentially unbound variables and without any nested \(\exists\) existential and \(\forall\) universal quantification.

We can add a second assertion to our example that makes it unsatisfiable:

(declare-const x Int)

(assert (= 5 (+ x 2)))

;; NEW: also, x + 1 should be 10.
(assert (= 10 (+ x 1)))

;; This time, the result should be unsatisfiable,
;; because there are conflicting requirements for `x`.
(check-sat)
(get-model)

Open and run this snippet in an online Z3 editor!

The assertions 10 = x + 1 and 5 = x + 2 put conflicting requirements on x, and therefore there is no value for x that can make both assertions true, and therefore the whole query is unsatisfiable.

Counterexample-Guided Iterative Synthesis

Counterexample-guided iterative synthesis (CEGIS) enables us to solve second-order, exists-forall queries — like our program synthesis problem — with off-the-shelf SMT solvers. CEGIS does this by decomposing these difficult queries into multiple first-order, \(\exists\) existentially quantified queries. These are the kind of queries that off-the-shelf SMT solvers excel at solving. First, we’ll look at CEGIS in general, and after that we’ll examine what is required specifically for component-based CEGIS.

CEGIS begins by choosing an initial, finite set of inputs. There has to be at least one, but it doesn’t really matter where it came from; we can use a random number generator. Then we start looping. The first step of the loop is finite synthesis, which generates a program that is correct at least for the inputs in our finite set. It may or may not be correct for all inputs, but we don’t know that yet. Next, we take that candidate program and verify it: we want determine whether it is correct for all inputs (in which case we’re done), or if there is some input for which the candidate program is incorrect (called a counterexample). If there is a counterexample, we add it to our set, and continue to the next iteration of the loop. The next program that finite synthesis produces will be correct for all the old inputs, and also this new counterexample. The counterexamples force finite synthesis to come up with more and more general programs that are correct for more and more inputs, until finally it comes up with a fully general program that works for all inputs.

Without further ado, here is the general CEGIS algorithm:

\(\begin{align} & \texttt{CEGIS}(\phi_\mathrm{spec}(\vec{I}, O)): \\ & \qquad S = \langle \text{initial finite inputs} \rangle \\ & \qquad \\ & \qquad \textbf{loop}: \\ & \qquad \qquad \texttt{// Finite synthesis.} \\ & \qquad \qquad \textbf{solve for $P$ in } \exists P,O_0,\ldots,O_{\lvert S \rvert - 1}: \\ & \qquad \qquad \qquad \left( P(S_0) = O_0 \land \phi_\mathrm{spec}(S_0, O_0) \right) \\ & \qquad \qquad \qquad \land \ldots \\ & \qquad \qquad \qquad \land \left( P(S_{\lvert S \rvert - 1}) = O_{\lvert S \rvert - 1} \land \phi_\mathrm{spec}(S_{\lvert S \rvert - 1}, O_{\lvert S \rvert - 1}) \right) \\ & \qquad \qquad \textbf{if } \texttt{unsat}: \\ & \qquad \qquad \qquad \textbf{error} \text{ “no solution”} \\ & \qquad \qquad \\ & \qquad \qquad \texttt{// Verification.} \\ & \qquad \qquad \textbf{solve for $\vec{I}$ in } \exists \vec{I},O: \,\, P(\vec{I}) = O \land \lnot \phi_\mathrm{spec}(\vec{I}, O) \\ & \qquad \qquad \textbf{if } \texttt{unsat}: \\ & \qquad \qquad \qquad \textbf{return $P$} \\ & \qquad \qquad \textbf{else}: \\ & \qquad \qquad \qquad \textbf{append $\vec{I}$ to $S$} \\ & \qquad \qquad \qquad \textbf{continue} \end{align}\)

CEGIS decomposes the exists-forall synthesis problem into two parts:

  1. Finite synthesis: The first query, the finite synthesis query, finds a program that is correct for at least the finite example inputs in \(S\). Here’s its breakdown:

    \( \exists P,O_0,\ldots,O_{\lvert S \rvert - 1}: \) There exists some program \(P\) and outputs \(O_0,\ldots,O_{\lvert S \rvert - 1}\) such that
    \( ( \,\, P(S_0) = O_0 \) \(O_0\) is the output of running the program on inputs \(S_0\)
    \( \land \) and
    \( \phi_\mathrm{spec}(S_0, O_0) \,\, ) \) the specification is satisfied for the inputs \(S_0\) and output \(O_0\),
    \( \land \ldots \) and…
    \( \land \left( P(S_{\lvert S \rvert - 1}) = O_{\lvert S \rvert - 1} \land \phi_\mathrm{spec}(S_{\lvert S \rvert - 1}, O_{\lvert S \rvert - 1}) \right) \) and \(O_{\lvert S \rvert - 1}\) is the output of running the program on inputs \(S_{\lvert S \rvert - 1}\), and the specification is satisfied for these inputs and output.

    Note that this is a first-order, existential query; it is not using nested \(\forall\) universal quantification over all possible inputs! Instead, it is instantiating a new copy of \(P(S_i) = O_i \land \phi_\mathrm{spec}(S_i, O_i)\) for each example in our finite set \(S\).

    For example, if \(S = \langle 3, 4, 7 \rangle\), then the finite synthesis query would be

    \(\begin{align} & \exists P,O_0,O_1,O_2: \\ & \qquad \left( P(3) = O_0 \land \phi_\mathrm{spec}(3, O_0) \right) \\ & \qquad \land \,\, \left( P(4) = O_1 \land \phi_\mathrm{spec}(4, O_1) \right) \\ & \qquad \land \,\, \left( P(7) = O_2 \land \phi_\mathrm{spec}(7, O_2) \right) \\ \end{align}\)

    This inline expansion works because the finite set of inputs \(S\) is much smaller in practice (typically in the tens, if even that many) than the size of the set of all possible inputs (e.g. there are \(2^{32}\) bitvectors 32 bits wide).

    If the query was unsatisfiable, then there is no program that can implement the specification for every one of the inputs in \(S\). Since \(S\) is a subset of all possible inputs, that means that there is no program that can implement the specification for all inputs. And since that is what we are searching for, it means that the search has failed, so we return an error.

    If the query was satisfiable, the resulting program \(P\) satisfies the specification for all the inputs in \(S\), but we don’t know whether it satisfies the specification for all possible inputs or not yet. For example, if \(S = \langle 0, 4 \rangle \), then we know that the program \(P\) is correct when given the inputs \(0\) and \(4\), but it may or may not be correct when given the input \(1\). We don’t know yet.

  2. Verification: Verification takes the program \(P\) produced by finite synthesis and checks whether it satisfies the specification for all inputs. That’s naturally expressed as a \(\forall\) universally quantified query over all inputs, but we can instead ask if there exists any input to the program for which the specification is not satisfied thanks to De Morgan’s law.

    Here’s the breakdown of the verification query:

    \( \exists \vec{I}, O: \) Does there exist some inputs \(\vec{I}\) and output \(O\) such that
    \( P(\vec{I}) = O\) \(O\) is the result of running the program on \(\vec{I}\)
    \( \land \) and
    \( \lnot \phi_\mathrm{spec}(\vec{I}, O) \) the specification is not satisfied.

    If the verification query is unsatisfiable, then there are no inputs to \(P\) for which the specification is not satisfied, which means that \(P\) satisfies the specification for all inputs. If so, this is what we are searching for, and we’ve found it, so return it!

    However, if the verification query is satisfiable, then we’ve discovered a counterexample: a new input \(\vec{I}\) for which the program does not satisfy the specification. That is, the program \(P\) is buggy when given \(\vec{I}\), so \(P\) isn’t the program we are searching for.

    Next, we add the new \(\vec{I}\) to our finite set of inputs \(S\), so that in the next iteration of the loop, we will synthesize a program that produces a correct result when given \(\vec{I}\) in addition to all the other inputs in \(S\).

As the loop iterates, we add more and more inputs to \(S\), forcing the finite synthesis query to produce more and more general programs. Eventually it produces a fully general program that satisfies the specification for all inputs. In the worst case, we are adding every possible input into \(S\): finite synthesis comes up with a program that fails verification when given 1, and then when given 2, and then 3, and so on. In practice, each counterexample \(\vec{I}\) that verification finds tends to be representative of many other inputs that are also currently unhandled-by-\(P\). By adding \(\vec{I}\) to \(S\), the next iteration of finite synthesis will produce a program that handles not just \(\vec{I}\) but also all the other inputs that \(\vec{I}\) was representative of.

For example, finite synthesis might have produced a program that handles all of \(S = \langle 0,1,5,13 \rangle\) but which is buggy when given a positive, even number. Verification finds the counterexample \(I = 2\), which gets appended to \(S\), and now \(S = \langle 0,1,5,13,2 \rangle\). Then, the next iteration of the loop synthesizes a program that doesn’t just also work for 2, but works for all positive even numbers. This is what makes CEGIS effective in practice.

Finally, notice that both finite synthesis and verification are first-order, \(\exists\) existentially quantified queries that off-the-shelf SMT solvers like Z3 can solve.

CEGIS with Components

Now that we know how CEGIS works in the abstract, let’s dive into how we can use it to synthesize component-based programs.

For every loop-free program that is a composition of components, we can flip the program’s representation into a location mapping:

  • Instead of listing the program line-by-line, defining what component is on each line, we can list components, defining what line the component ends up on.

  • Instead of referencing the arguments to each component by variable name, we can reference either the line on which the argument is defined (if it comes from the result of an earlier component) or as the \(n^\mathrm{th}\) program input.

For example, consider our program to isolate the rightmost zero bit in a word:

isolate_rightmost_zero_bit(x):
    a <- not x
    b <- add 1, x
    c <- and a, b
    return c

We can exactly represent this program with the following location mapping:

{
    inputs: ["x"],

    components: {
        // Line 1: `b <- add 1, x`
        add1: {
            // The line in the program where this
            // component is placed.
            line: 1,

            // Each entry `a` represents where the
            // argument comes from.
            //
            // If `0 <= a < inputs.len()`, then the
            // argument is `inputs[a]`.
            //
            // Otherwise, when `inputs.len() <= a`,
            // then the argument comes from the value
            // defined by the component on line
            /// `a - inputs.len()`.
            arguments: [
                // `x`
                0,
            ],
        },

        // Line 2: `c <- and a, b`
        and: {
            line: 2,
            arguments: [
                // `a`
                1,
                // `b`
                2,
            ],
        },

        // Line 0: `a <- not x`
        not: {
            line: 0,
            arguments: [
                // `x`
                0,
            ],
        },
    },
}

With component-based CEGIS, we’ll be synthesizing this kind of location mapping. This lets us represent a whole component-based program with a handful of numbers for lines and argument indices. And numbers are something that we can represent directly in an SMT query.

Verifying a Component-Based Program

Let’s start with verying a component-based program before we look at their finite synthesis. Verification takes a location mapping, connects the components’ input and output variables together as described by the location mapping, and asks the SMT solver to find a counterexample.

For convenience, so we don’t have to keep repeating \(\vec{I}_0,\ldots,\vec{I}_{N-1}\) all the time, we define \(\textbf{P}\) as the set of all the parameter variables for each component in the library:

\( \textbf{P} = \, \vec{I}_0 \, \cup \, \ldots \, \cup \, \vec{I}_{N-1} \)

And similarly we define \(\textbf{R}\) as the set of all temporary result variables for each component in the library:

\( \textbf{R} = \{O_0, \, \ldots, \, O_{N-1}\} \)

With our running example of isolating the rightmost zero bit, our minimal library consists of

\( \begin{align} \phi_0(I_0, O_0) &= [O_0 = \texttt{bvadd}(1, I_0)] \\ \phi_1(I_1, I_2, O_1) &= [O_1 = \texttt{bvand}(I_1, I_2)] \\ \phi_2(I_3, O_2) &= [O_2 = \texttt{bvnot}(I_3)] \end{align} \)

and therefore its

\( \begin{align} N &= 3 \\ \textbf{P} &= \{ I_0, I_1, I_2, I_3, I_4 \} \\ \textbf{R} &= \{ O_0, O_1, O_2 \} \end{align} \)

We want to constrain the whole library to behave according to its individual component specifications. The output of each and component should indeed be the bitwise and of its inputs, and the output of each not component should indeed be the bitwise not of its input, etc… We define \(\phi_\mathrm{lib}\) as the combination of every component specification \(\phi_i\):

\( \phi_\mathrm{lib}(\textbf{P}, \textbf{R}) = \phi_i(\vec{I}_0, O_0) \land \ldots \land \phi_i(\vec{I}_{N-1}, O_{N-1}) \)

So for our minimal example library, the \(\phi_\mathrm{lib}\) we get is:

\( \begin{align} \phi_\mathrm{lib}(\textbf{P}, \textbf{R}) &= [ O_0 = \texttt{bvadd}(1, I_0) ] \\ &\land [ O_1 = \texttt{bvand}(I_1, I_2) ] \\ &\land [ O_2 = \texttt{bvnot}(I_3) ] \end{align} \)

That is, the library’s constraints are satisfied when all of

  • \(O_0\) is the wrapping addition of \(I_0\) and 1,
  • \(O_1\) is the bitwise and of \(I_1\) and \(I_2\), and
  • \(O_2\) is the bitwise not of \(I_3\),

Because finite synthesis runs before verification, we already have access to the candidate program’s location mapping when we’re constructing our verification query. This location mapping tells us which actual arguments align with which formal parameters of a component. That means we know what the connections are from each component’s input variables to the program inputs and the temporary result variables for other components. We know the dataflow between components.

Let’s make this concrete with our isolating-the-rightmost-zero-bit example. Having produced this candidate program:

a <- not x
b <- add 1, x
c <- and a, b

With this library:

\( \begin{align} \phi_0(I_0, O_0) &= [O_0 = \texttt{bvadd}(1, I_0)] \\ \phi_1(I_1, I_2, O_1) &= [O_1 = \texttt{bvand}(I_1, I_2)] \\ \phi_2(I_3, O_2) &= [O_2 = \texttt{bvnot}(I_3)] \end{align} \)

We know that \(\texttt{a} = O_2\), since it is the result of the not component \(\phi_2(I_3, O_2)\). And since a is the first argument to and a, b, which uses component \(\phi_1(I_1, I_2, O_1)\), we know that \(\texttt{a} = I_1\). Therefore, we know that \(O_2 = I_1\).

We have these equalities from each component input variable \(I_i\) in \(\textbf{P}\) to either some other component’s output variable \(O_j\) in \(\textbf{R}\) or to one of the program inputs \(\vec{I}\). These equalities are given to us directly by the location mapping for the candidate program that we’re verifying.

Additionally, because our candidate program is implicitly returning the last temporary variable c, which is the result of the and component \(\phi_1(I_1, I_2, O_1)\), and because the \(O\) in \(\phi_\mathrm{spec}(\vec{I}, O)\) represents the result of the whole program, we know that \(O = O_1\).

If we put all these equalities together for our example program we get:

\( \left( I_0 = x \right) \land \left( I_1 = O_2 \right) \land \left( I_2 = O_1 \right) \land \left( I_3 = x \right) \land \left( O = O_1 \right)\)

This represents all the connections between our library’s various components and to the candidate program’s inputs and output. If you imagine connecting the components together like a circuit, this represents all the wires between each component.

We define these component-connecting equalities as \(\phi_\mathrm{conn}\), and its general definition is:

\( \begin{align} \phi_\mathrm{conn}(\vec{I}, O, \textbf{P}, \textbf{R}) &= \left( O = O_\mathrm{last} \right) \\ & \land \left( \vec{I}_0 \, = \, \vec{V}_0 \right) \\ & \land \, \ldots \\ & \land \left( \vec{I}_{N-1} \, = \, \vec{V}_{N-1} \right) \end{align} \)

Where

  • \(\vec{V}_i\) are the actual arguments that the candidate program passes into the \(i^\mathrm{th}\) component \(\phi_i\). Each \(\vec{V}_i\) is made up of entries from either the program’s inputs \(\vec{I}\) or from temporary results from \(\textbf{R}\) that are defined by earlier components in the program. This is equivalent to the arguments field defined for each component in our example location mapping’s components map.

  • \(O_\mathrm{last}\) is the output variable for the component on the last line of the program, according to our candidate program’s location mapping.

Once again, let’s break that down:

\( \left( O = O_\mathrm{last} \right) \) The output of the whole program is equal to the result of the component on the last line of the program,
\( \land \) and
\( \left( \vec{I}_0 \, = \, \vec{V}_0 \right) \) the first component's inputs and its assigned arguments are equal to each other,
\( \land \, \ldots \) and...
\( \left( \vec{I}_{N-1} \, = \, \vec{V}_{N-1} \right) \) the last component's inputs and its assigned arguments are equal to each other.

Note that both \(O_\mathrm{last}\) and each \(\vec{V}_i\) are properties of the candidate program’s location mapping, and are known at “compile time” of the verification query. They are not variables that we are \(\exists\) existentially or \(\forall\) universally quantifying over in the query itself. We expand them inline when constructing the verification query.

Ok, so with all of that out of the way, we can finally define the verification constraint that we use in component-based CEGIS:

\( \begin{align} & \exists \vec{I}, O, \textbf{P} , \textbf{R} : \\ & \qquad \phi_\mathrm{conn}(\vec{I}, O, \textbf{P}, \textbf{R}) \land \phi_\mathrm{lib}(\textbf{P}, \textbf{R}) \land \lnot \phi_\mathrm{spec}(\vec{I}, O) \end{align} \)

The verification constraint asks: given that we’ve connected the components together as described by the candidate program’s location mapping, are there any inputs for which the specification is not satisfied?

Let’s break that down once more:

\( \exists \vec{I}, O, \textbf{P} , \textbf{R} : \) Does there exist some inputs and output such that
\(\phi_\mathrm{conn}(\vec{I}, O, \textbf{P}, \textbf{R}) \) when the components are connected together as described by our candidate program's location mapping,
\( \land \,\, \phi_\mathrm{lib}(\textbf{P}, \textbf{R}) \) and when the components behave as defined by our library,
\( \land \,\, \lnot \phi_\mathrm{spec}(\vec{I}, O) \) the specification is not satisfied?

Finding a solution to this query gives us a new counterexample \(\vec{I}\) that we can add to our set of examples \(S\) for future iterations of the CEGIS loop. Failure to find any solution to this query means that the candidate location mapping corresponds to a program that is correct for all inputs, in which case we’re done.

Finite Synthesis of a Component-Based Program

Finite synthesis composes the library components into a program that will correctly handle all the given example inputs. It does this by querying the SMT solver for a location mapping that contains assignments of components to lines in the program, and assignments of variables to each component’s actual arguments.

Recall our example location mapping:

{
    inputs: ["x"],

    components: {
        // Line 1: `b <- add 1, x`
        add1: {
            line: 1,
            arguments: [0], // `[x]`
        },

        // Line 2: `c <- and a, b`
        and: {
            line: 2,
            arguments: [1, 2], // `[a, b]`
        },

        // Line 0: `a <- not x`
        not: {
            line: 0,
            arguments: [0], // `[x]`
        },
    },
}

To encode a location mapping in the finite synthesis query, every component parameter in \(\textbf{P}\) and every component result in \(\textbf{R}\) gets an associated location variable. The finite synthesis query is searching for an assignment to these location variables.

We call the set of all location variables \(L\), and we refer to a particular location variable as \(l_x\) where \(x\) is either a component result in \(\textbf{R}\) or component parameter in \(\textbf{P}\):

\( L = \{ \, l_x \, \vert \, x \in \textbf{P} \cup \textbf{R} \, \} \)

The location variable for a result \(l_{O_i}\) is equivalent to the line field for a component in our JSON-y syntax for example location mappings. It determines the line in the program that the component is assigned to, and therefore where its temporary result is defined.

The location variable for a parameter \(l_p\) is equivalent to an entry in a component’s arguments list in our JSON-y syntax. These location variables determine where the associated parameter gets its value from: either the \(i^\mathrm{th}\) program input or the temporary result defined on the \(j^\mathrm{th}\) line of the program.

To use one index space for both line numbers and program inputs, we follow the same convention that we did with entries in the arguments list in the JSON syntax:

  • When \(l_x\) is less than the number of program inputs, then it refers to the \({l_x}^\mathrm{th}\) program input.

  • Otherwise, when \(l_x\) is greater than or equal to the number of program inputs, then subtract the number of inputs from \(l_x\) to get the line number it’s referring to.

Value of \(l_x\) Refers To Location
0 Input 0 Program Inputs \(\vec{I}\)
1 Input 1
... ...
\( \lvert \vec{I} \rvert - 1 \) Input \( \lvert \vec{I} \rvert - 1 \)
\(\lvert \vec{I} \rvert + 0\) Line 0 Line Numbers
\(\lvert \vec{I} \rvert + 1\) Line 1
... ...
\(\lvert \vec{I} \rvert + N - 1\) Line \(N - 1\)

All loop-free, component-based programs can be described with a location mapping. However, the reverse is not true: not all location mappings describe a valid program.

Consider this location mapping:

{
    inputs: ["x"],

    components: {
        // Line 0: `a <- add 1, x`
        add1: {
            line: 0,
            arguments: [0], // `[x]`
        },

        // Line 0: `a <- sub x, 1`
        sub1: {
            line: 0,
            arguments: [0], // `[x]`
        }
    }
}

This line mapping is inconsistent because it wants to put both its components on line zero of the program, but each line in the program can only use a single component.

To forbid the solver from providing bogus answers of this sort, we add the consistency constraint \(\psi_\mathrm{cons}\) to the finite synthesis query. It requires that no pair of distinct component result location variables can be assigned the same line.

\( \psi_\mathrm{cons}(L) = \bigwedge\limits_{x,y \in \textbf{R}, x \not\equiv y} \left( l_x \neq l_y \right) \)

Once more, let’s break that down:

\( \bigwedge\limits_{x,y \in \textbf{R}, x \not\equiv y} \) For each \(x,y\) pair of component results, where \(x\) and \(y\) are not the same result variable,
\( \left( l_x \neq l_y \right) \) the location of \(x\) and the location of \(y\) are not the same.

But there are even more ways that a location mapping might describe an invalid program! Consider this location mapping:

{
    inputs: ["x"],

    components: {
        // Line 0: `a <- and x, b`
        and: {
            line: 0,
            arguments: [0, 2], // `[x, b]`
        },

        // Line 1: `b <- sub a, 1`
        sub1: {
            line: 1,
            arguments: [1], // `[a]`
        }
    }
}

That location mapping describes this program:

f(x):
    a <- and x, b
    b <- sub a, 1

The b temporary result is used before it is defined, and in order to compute b, we need to compute a, but computing a requires computing b, which requires computing a, etc… We have a cycle on our hands.

To forbid mappings that correspond to bogus programs with dataflow cycles, we use the acyclicity constraint \(\psi_\mathrm{acyc}\). This constraint enforces that a particular component’s parameters are defined before this component’s line.

\( \psi_\mathrm{acyc}(L) = \bigwedge\limits_{i=0}^{N-1} \left( \bigwedge\limits_{p \in \vec{I}_i} \left( l_p < l_{O_i} \right) \right) \)

Let’s break that down:

\( \bigwedge\limits_{i=0}^{N-1} \) For each component index \(i\),
\( \bigwedge\limits_{p \in \vec{I}_i} \) and for each of the \(i^\mathrm{th}\) component's input parameters,
\( l_p < l_{O_i} \) the location of the parameter should be less than the location of the component, meaning that the parameter is defined before the component is used.

The only other way that location mappings can be invalid is if a location is out of bounds of the program inputs and line numbers, so we’re ready to define the well-formed-program constraint \(\psi_\mathrm{wfp}\). This constraint enforces that any location mapping we synthesize will correspond to a well-formed program.

A well-formed program is

  • consistent,

  • acyclic,

  • its component parameter locations point to either a program input or a line number, and

  • its component temporary result locations point to a line number.

Let’s define \(M\) as the number of program inputs plus the number of components in the library:

\( M = \lvert \vec{I} \rvert + N \)

A component parameter location \(l_{p \in \textbf{P}}\) can point to either

  • a program input in the range from zero to the number of program inputs: \(0 \leq l_{p} \lt \lvert \vec{I} \rvert\), or

  • a line number, which corresponds to the \(N\) locations following the program inputs: \(\lvert \vec{I} \rvert \leq l_p \lt M \).

Since those two ranges are contiguous, it means that component parameter locations ultimately fall within the range \(0 \leq l_p \lt M\).

A component temporary result location \(l_{r \in \textbf{R}}\) must point to a line number, which means that they fall within the range \(\lvert \vec{I} \rvert \leq l_r \lt M\).

Put all that together and we get the well-formed-program constraint \(\psi_\mathrm{wfp}\):

\( \begin{align} \psi_\mathrm{wfp}(L) &= \bigwedge\limits_{p \in \textbf{P}} \left( 0 \leq l_p \lt M \right) \\ & \land \, \bigwedge\limits_{r \in \textbf{R}} \left( \lvert \vec{I} \rvert \leq l_r \lt M \right) \\ & \land \, \psi_\mathrm{cons}(L) \\ & \land \, \psi_\mathrm{acyc}(L) \end{align} \)

And here is its breakdown:

\( \bigwedge\limits_{p \in \textbf{P}} \left( 0 \leq l_p \lt M \right) \) Each component parameter location \(l_p\) points to either a program input or a line number,
\( \land \, \bigwedge\limits_{r \in \textbf{R}} \left( \lvert \vec{I} \rvert \leq l_r \lt M \right) \) and each component result location \(l_r\) points to a line number,
\( \land \, \psi_\mathrm{cons}(L) \) and the location mapping is consistent,
\( \land \, \psi_\mathrm{acyc}(L) \) and the location mapping is acyclic.

Now that we can constrain finite synthesis to only produce location mappings that correspond to well-formed programs, all we need to do is encode the connections between components and the behavior of the library. This should sound familiar: we need the finite synthesis equivalent of \(\phi_\mathrm{conn}\) and \(\phi_\mathrm{lib}\) from verification. And it turns out that \(\phi_\mathrm{lib}\) doesn’t need to be tweaked at all, because the behavior of the library remains the same whether we are in verification or finite synthesis. But while \(\phi_\mathrm{conn}\) was checking a set of already-known connections between components, in finite synthesis we are searching for those connections, so we need a different query.

These connections define the dataflow between components. They are the wires in the circuit built from our components. A connection from some component result into another component’s input means that we need to constrain the result and input variables to be equal in the finite synthesis query. For example, if component \(\phi_i\) get’s placed on line 3, and parameter \(p\) is assigned the location 3, then \(p\) must take on the same value as the output \(O_i\) of that component.

This leads us to our definition of \(\psi_\mathrm{conn}\): for every pair of location variables \(l_x\) and \(l_y\), if they refer to the same location, then \(x\) and \(y\) must have the same value.

\( \psi_\mathrm{conn}(L, \vec{I}, O, \textbf{P}, \textbf{R}) = \bigwedge\limits_{x,y \in \vec{I} \cup \textbf{P} \cup \textbf{R} \cup { O } } \left( \left( l_x = l_y \right) \implies \left( x = y \right) \right) \)

Here is its piece-by-piece breakdown:

\( \bigwedge\limits_{x,y \in \vec{I} \cup \textbf{P} \cup \textbf{R} \cup \{ O \} } \) For each pair of location variables \(l_x\) and \(l_y\), where \(x\) and \(y\) are either a program input, or a component's parameter, or a component's temporary result, or the program output,
\( \left( l_x = l_y \right) \) if the location variables refer to the same location,
\( \implies \) then
\( \left( x = y \right) \) \(x\) and \(y\) must have the same value.

We’re finally ready to define our finite synthesis query for a location mapping. This query asks the solver to find some location mapping that corresponds to a well-formed program that satisfies our specification for each example input in \(S\). In other words, it must enforce that

  • the location mapping corresponds to a well-formed program, and

  • when the components are connected as described by the location mapping, and when the components behave as described by our library,

  • then the specification is satisfied for each of our example inputs in \(S\).

Here it is, finally, our finite synthesis query:

\( \begin{align} & \exists L, O_0, \ldots, O_{\vert S \rvert - 1}, \textbf{P}_0, \ldots, \textbf{P}_{\vert S \rvert - 1}, \textbf{R}_0, \ldots, \textbf{R}_{\vert S \rvert - 1}: \\ & \qquad \psi_\mathrm{wfp}(L) \,\, \land \\ & \qquad \qquad \bigwedge\limits_{i=0}^{\lvert S \rvert - 1} \left( \phi_\mathrm{lib}(\textbf{P}_i, \textbf{R}_i) \land \psi_\mathrm{conn}(L, S_i, O_i, \textbf{P}_i, \textbf{R}_i) \land \phi_\mathrm{spec}(S_i, O_i) \right) \\ \end{align} \)

That’s quite a mouthful, so, one last time, let’s pull it apart and break it down:

\( \exists L, O_0, \ldots, O_{\vert S \rvert - 1}, \textbf{P}_0, \ldots, \textbf{P}_{\vert S \rvert - 1}, \textbf{R}_0, \ldots, \textbf{R}_{\vert S \rvert - 1}: \) There exists some location mapping \(L\), and program outputs, component parameters, and component results variables for each example in \(S\), such that
\( \psi_\mathrm{wfp}(L) \,\, \land \) the location mapping is a well-formed program, and
\( \bigwedge\limits_{i=0}^{\lvert S \rvert - 1} \) for each example input index \(i\),
\( \phi_\mathrm{lib}(\textbf{P}_i, \textbf{R}_i) \) the components behave as described by the library,
\( \land \,\, \psi_\mathrm{conn}(L, S_i, O_i, \textbf{P}_i, \textbf{R}_i) \) and the components are connected as described by the location mapping,
\( \land \,\, \phi_\mathrm{spec}(S_i, O_i) \) and the specification is satisfied for the \(i^\mathrm{th}\) example input.

When the solver finds a satisfiable assignment for this query, we get a new candidate location mapping that corresponds to a program that is correct for each of the example inputs in \(S\). When the solver finds the query unsatisifiable, that means there is no locaiton mapping that corresponds to a program that is correct for each of the example inputs, which means that our search has failed.

Implementation

I implemented a loop-free, component-based program synthesizer in Rust that uses Z3 to solve the finite synthesis and verification queries. The implementation’s repository is over here.

Our target language has all the operations you would expect for working with fixed-width integers. It has arithmetic operations like add and mul. It has bitwise operations like and and xor. It has comparison operations like eq, that evaluate to one if the comparison is true and zero otherwise. Finally, it has a select operation that takes three operands: a condition, a consequent, and an alternative. When the condition is non-zero, it evaluates to the consequent, and otherwise it evaluates to the alternative.

Values are neither signed nor unsigned. For operations like division that behave differently on signed and unsigned integers, we have a different instruction for each behavior: div_s for signed division and div_u for unsigned division.

Program Representation

A program is a sequence of instructions:

pub struct Program {
    instructions:
                                        

 

Добавить комментарий:
Текст комментария: смайлики

Проверка орфографии: (найти ошибки)

Прикрепить картинку:

 Переводить URL в ссылку
 Подписаться на комментарии
 Подписать картинку