## Formal Power Series and Determinants

It seems that wherever you go in mathematics linear algebra comes chasing you. Here is one such example where you are simply dealing with subsets and nothing more. You are probably aware that subsets of {1, …, n} can be represented formally using the power series

1 ≤ i1 < i2 < … < ik ≤ nxi1xi2xik

The above expression can be more succintly written as the following product.

(1 + x1)(1 + x2)…(1 + xn)

The sign alternating version of this is given below where subsets of even size have  + 1 as their coefficient and subsets of odd size have  − 1 as their coefficient.

(1 − x1)(1 − x2)…(1 − xn)

Surprisingly, this can be computed as a determinant! Recall, that a determinant of a square matrix is

det(A)1 ≤ i, j ≤ n = ∑w ∈ Snsgn(w)A1w(1)Anw(n)

Now, construct a matrix A as follows 1) Aij = xj if i ≤ j, 2) Aij = 1 if i = j + 1, and 3) Aij = 0 otherwise.

Convince yourself that the determinant of A is (1 − x1)(1 − x2)…(1 − xn).

## On Math Writing

I recently came across a wonderful set of notes on mathematical writing. It is a set of notes written by students who took Knuth’s class on mathematical writing. The link is here. It’s wonderful to see the amount of thought that Knuth puts into his writing down to worrying about whether he should put commas to separate a big number like 1233843 (and whether the rules change when the number is inside a formula). The notes contain some funny anecdotes.

## Paper: A Generative Model for Parsing Natural Language to Meaning Representations (67/365)

Today’s paper once again constructs a semantic parser. The parser is trained on sentences paired with their meaning representations. But there is no finer labeling of the correspondence between words and meaning tokens.

The meaning representation in this paper takes the form of a tree whose nodes have both natural language words and meaning representation tokens. They say that the meaning representation is variable-free but I have to trace another reference to see what that precisely means (for later). An example meaning representation is shown below

Every node in the tree takes on the following form where $X_1$ to $X_k$ are also semantic cateogries. Some examples are ‘River: largest(River)’ and ‘Num: count(State)’.

$\displaystyle \text{semantic category} \rightarrow \text{function}(X_1, \dots, X_k)$

A hybrid tree is an extension of this tree that captures both the sentence and the meaning representation. The only difference is that every node can also emit NL tokens. The leaves of an MR are always NL tokens. Generation of a tree is viewed as a Markov process where 1) we start with a root production, 2) and we recursively expand its parameters, 3) at each node we can emit NL tokens.

Unlike in a PCFG parsing task where the correspondence between NL words and syntactic structures is available, the current model does not have access to this data. Thus, we need to compute the expected parameters from all possible tree derivations. The authors adapt the inside-outside algorithm for this purpose. I won’t go into the details at this point. Again, I’ll wait to link it with other papers.

I kind of lost interest towards the end because there is an extra re-ranking phase after finding the most likely hybrid-tree for a given sentence. Anyway, as this is a paper is pretty old (2008), let’s see what more recent papers do.

## Paper: PPDB: The Paraphrase Database (66/365)

In the last post, the authors made use of paraphrases. It turns out that there is in fact a paraphrase database and it’s quite interesting how it was created. It starts with the basic observation that given translated texts from english to some foreign language, if two english phrases $e_1$ and $e_2$ translate to the same foreign phrase $f$ then we may assume that $e_1$ and $e_2$ have similar meaning; i.e., that they paraphrase eachother.

The paper goes a little further than this. The goal is to extract a paraphrase rule as follows

$\displaystyle \mathbf{r}_p : = C \rightarrow \langle e_1, e_2, \sim_p, \phi_p \rangle$

where $C$ is a non-terminal, $e_1$ and $e_2$ are mix of terminal and non-terminal symbols where both share the same set of non-terminal symbols (given by the correspondence $~_p$, and a feature vector $\phi_p$.

Such a rule is construction from a syntactic machine translator, where two applied translation rules having the same syntactic construct and foreign phrase are matched

$\displaystyle \mathbf{r}_ 1 : = C \rightarrow \langle f, e_1, \sim_1, \phi_ 1 \rangle \\ \mathbf{r}_ 2 : = C \rightarrow \langle f, e_2, \sim_2, \phi_ 2 \rangle$

where once again $f$ and $e$ share the same non-terminals so that the above rule for pairing $e_1$ and $e_2$ can be constructed. The paper also defines a way to combine the feature vectors but I will skip that here.

## Paper: Building a Semantic Parser Overnight (65/365)

In a previous post I talked about a paper that encoded questions as programs which in turn defined a procedure that executed in the environment of a battleship game. In other words, each question defines a program that defines the semantics of the question.

Of course, that paper did not consider the translation of the natural text into the program. Today’s paper by Wang et. al considers building a semantic parser: to translate a natural language text into a precise internal program that can be executed against an environment (database) to produce an answer.

This paper encodes the semantics in a langugage called lambda DCS based on lambda calculus. It’s compositional form is quite interesting. Every logical form is either a set of entities (unary) or a set of entity pairs (binary). A unary $u$ and a binary $b$ can be composed to produce a binary $b.u$ which are pairs in $b$ restricted to pairs whose second entities are present in $u$. The other operators are the set operations on the unary terms.

They have also defined a canonical translation of programs into natural language.

"start date of student alice whose university is brown university"
R(date).(student.Alice ∩ university.Brown)

Since there are only so many ways to compose these programs, the canonical phrases can be generated easily. What they then do is use Amazon’s Mechanical Turk to convert these canonical phrases to something more familiar. The above example is converted to “When did alice start attending brown university?”. These pairs are then used to train a paraphrasing model

$\displaystyle p_\theta(z, c | x, w) \propto \exp \left( \phi(c, z, x, w)^T \theta\right)$

where $\phi(z,c,x,w)$ is a feature vector and $\theta$ is a parameter vector; $z$ is the logical form (i.e. a program); $c$ is the canonical phrase; $x$ is the human phrase for the canonical phrase; and $w$ is the database (represented as a set of triples) that can be queried by the logical forms. I’ll come back to the details of the implementation as I find similar papers.

## Paper: Concepts in a Probabilistic Language of Thought (64/365)

In a previous post, I described a paper that wrote questions as a LISP expression. The authors of that paper take their input from today’s paper by Goodman et. al. I think it’s clear to everyone that concepts, as humans use them, is highly compositional. We can use them abstractly, concretely, suppositionally, and in every way imaginable. It’s interesting to ponder whether this is the reason why our languages are so flexible or whether languague’s flexibility has made our thought so flexible.

So, the first claim of this paper is that “concepts have a language-like compositionality and encode probabilistic knowledge. These features allow them to be extended productively to new situations and support flexible reasoning and learning by probabilistic inference.” This is fairly uncontroversial.

What the authors consider then is a formal system capable of expressing the same. Their suggestion is a probabilistic programming language called Church. The develop various helper constructs in this language that help with capturing ‘randomness’, ‘conditionality’, and ‘queries’. There isn’t must to show about these programs, as they are just regular programs. Compositionality is obvious in that functions can be used and extended. Uncertainty is encoded by impure functions which return sampled values. But these functions are memoized so that when called by the same input the same answer is returned.

The paper doesn’t introduce anything computational but I’ll bring this back when I find papers that develop this concept computationally.