## On Math Writing

I recently came across a wonderful set of notes on mathematical writing. It is a set of notes written by students who took Knuth’s class on mathematical writing. The link is here. It’s wonderful to see the amount of thought that Knuth puts into his writing down to worrying about whether he should put commas to separate a big number like 1233843 (and whether the rules change when the number is inside a formula). The notes contain some funny anecdotes.

## Paper: A Generative Model for Parsing Natural Language to Meaning Representations (67/365)

Today’s paper once again constructs a semantic parser. The parser is trained on sentences paired with their meaning representations. But there is no finer labeling of the correspondence between words and meaning tokens.

The meaning representation in this paper takes the form of a tree whose nodes have both natural language words and meaning representation tokens. They say that the meaning representation is variable-free but I have to trace another reference to see what that precisely means (for later). An example meaning representation is shown below Meaning representation as a tree

Every node in the tree takes on the following form where $X_1$ to $X_k$ are also semantic cateogries. Some examples are ‘River: largest(River)’ and ‘Num: count(State)’. $\displaystyle \text{semantic category} \rightarrow \text{function}(X_1, \dots, X_k)$

A hybrid tree is an extension of this tree that captures both the sentence and the meaning representation. The only difference is that every node can also emit NL tokens. The leaves of an MR are always NL tokens. Generation of a tree is viewed as a Markov process where 1) we start with a root production, 2) and we recursively expand its parameters, 3) at each node we can emit NL tokens.

Unlike in a PCFG parsing task where the correspondence between NL words and syntactic structures is available, the current model does not have access to this data. Thus, we need to compute the expected parameters from all possible tree derivations. The authors adapt the inside-outside algorithm for this purpose. I won’t go into the details at this point. Again, I’ll wait to link it with other papers.

I kind of lost interest towards the end because there is an extra re-ranking phase after finding the most likely hybrid-tree for a given sentence. Anyway, as this is a paper is pretty old (2008), let’s see what more recent papers do.

## Paper: PPDB: The Paraphrase Database (66/365)

In the last post, the authors made use of paraphrases. It turns out that there is in fact a paraphrase database and it’s quite interesting how it was created. It starts with the basic observation that given translated texts from english to some foreign language, if two english phrases $e_1$ and $e_2$ translate to the same foreign phrase $f$ then we may assume that $e_1$ and $e_2$ have similar meaning; i.e., that they paraphrase eachother.

The paper goes a little further than this. The goal is to extract a paraphrase rule as follows $\displaystyle \mathbf{r}_p : = C \rightarrow \langle e_1, e_2, \sim_p, \phi_p \rangle$

where $C$ is a non-terminal, $e_1$ and $e_2$ are mix of terminal and non-terminal symbols where both share the same set of non-terminal symbols (given by the correspondence $~_p$, and a feature vector $\phi_p$.

Such a rule is construction from a syntactic machine translator, where two applied translation rules having the same syntactic construct and foreign phrase are matched $\displaystyle \mathbf{r}_ 1 : = C \rightarrow \langle f, e_1, \sim_1, \phi_ 1 \rangle \\ \mathbf{r}_ 2 : = C \rightarrow \langle f, e_2, \sim_2, \phi_ 2 \rangle$

where once again $f$ and $e$ share the same non-terminals so that the above rule for pairing $e_1$ and $e_2$ can be constructed. The paper also defines a way to combine the feature vectors but I will skip that here.

## Paper: Building a Semantic Parser Overnight (65/365)

In a previous post I talked about a paper that encoded questions as programs which in turn defined a procedure that executed in the environment of a battleship game. In other words, each question defines a program that defines the semantics of the question.

Of course, that paper did not consider the translation of the natural text into the program. Today’s paper by Wang et. al considers building a semantic parser: to translate a natural language text into a precise internal program that can be executed against an environment (database) to produce an answer.

This paper encodes the semantics in a langugage called lambda DCS based on lambda calculus. It’s compositional form is quite interesting. Every logical form is either a set of entities (unary) or a set of entity pairs (binary). A unary $u$ and a binary $b$ can be composed to produce a binary $b.u$ which are pairs in $b$ restricted to pairs whose second entities are present in $u$. The other operators are the set operations on the unary terms.

They have also defined a canonical translation of programs into natural language.

"start date of student alice whose university is brown university"
R(date).(student.Alice ∩ university.Brown)

Since there are only so many ways to compose these programs, the canonical phrases can be generated easily. What they then do is use Amazon’s Mechanical Turk to convert these canonical phrases to something more familiar. The above example is converted to “When did alice start attending brown university?”. These pairs are then used to train a paraphrasing model $\displaystyle p_\theta(z, c | x, w) \propto \exp \left( \phi(c, z, x, w)^T \theta\right)$

where $\phi(z,c,x,w)$ is a feature vector and $\theta$ is a parameter vector; $z$ is the logical form (i.e. a program); $c$ is the canonical phrase; $x$ is the human phrase for the canonical phrase; and $w$ is the database (represented as a set of triples) that can be queried by the logical forms. I’ll come back to the details of the implementation as I find similar papers.

Posted in Uncategorized | Tagged , , | 1 Comment

## Paper: Concepts in a Probabilistic Language of Thought (64/365)

In a previous post, I described a paper that wrote questions as a LISP expression. The authors of that paper take their input from today’s paper by Goodman et. al. I think it’s clear to everyone that concepts, as humans use them, is highly compositional. We can use them abstractly, concretely, suppositionally, and in every way imaginable. It’s interesting to ponder whether this is the reason why our languages are so flexible or whether languague’s flexibility has made our thought so flexible.

So, the first claim of this paper is that “concepts have a language-like compositionality and encode probabilistic knowledge. These features allow them to be extended productively to new situations and support flexible reasoning and learning by probabilistic inference.” This is fairly uncontroversial.

What the authors consider then is a formal system capable of expressing the same. Their suggestion is a probabilistic programming language called Church. The develop various helper constructs in this language that help with capturing ‘randomness’, ‘conditionality’, and ‘queries’. There isn’t must to show about these programs, as they are just regular programs. Compositionality is obvious in that functions can be used and extended. Uncertainty is encoded by impure functions which return sampled values. But these functions are memoized so that when called by the same input the same answer is returned.

The paper doesn’t introduce anything computational but I’ll bring this back when I find papers that develop this concept computationally.

## Paper: Attention Is All You Need (63/365)

I can’t believe after my gripe yesterday that I have picked today’s paper. This paper claims that a feed-forward network with self-attention trains faster and performs better than a recurrent or a convolutional neural network on translation tasks. I have to come back to this at a later time though because I need to brush up on much of the terminology.

Posted in Uncategorized | Tagged , , | 1 Comment

## Paper: Question Asking as Program Generation (62/365)

I am very bored by language models that just pump the training data through something and then predict the word following a partially revealed sentence. I mean, I don’t learn anything from them apart from a lot of technical wizardry. So, I am trying to look for papers that lean towards true comprehension.

This paper “Question Asking as Program Generation” is from NIPS 2017. They have collected 605 questions that humans have asked in the context of several partially revealed battleship game board with a view to gain information about the positions of the hidden ships. So each of these questions comes with a context (the game board). The authors aim to create a system to predict the question given a context.

## Questions as Programs

In order to generate questions we need something compositional. The authors turn to programming. In this case, lisp programs. Every question is written as a lisp program. For example, How many tiles have the same colour as tile 2F? is written as (setSize (colouredTiles (colour 2F))). These are converted by hand.

They have a limited set of allowed predicates and constructions as the domain is limited to this battleship board. As a result, they define a simple context-free grammar that covers all possible valid programs.

## Question Usefulness

In the context of the battleship game, the human subjects were asked to come up with ‘useful’ questions. They were given plenty of training. It’s great having a grammar to generate valid programs but how many of them are actually useful? To that end, they define several features. I’ll describe these and skip over the probability model that is optimized.

The first is the Expected Information Gain value of a question. That is, how much uncertainty is reduced in knowing the actual board state after the answers (averaged over multiple contexts) to that question. So, questions which reveal more (on average) about the actual state get a higher score.

The second is complexity. Favoring the above feature only can lead to very long questions. So, this feature is thrown in to favor shorter questions. This measures the complexity of a question based on the grammar.

Another feature is the answer type. That is whether it is a yes/no question or a location or a colour.

## Conclusion

What’s interesting is that they were able to generate questions (i.e. programs) given a context. It’s interesting to tackle a problem like this. It is similar to say a robot learning language with respect to its senses. Learning can be very quick because the language has to be computed unlike unsupervised language models.

Posted in Uncategorized | Tagged , , | 1 Comment

## A History of Natural Language Processing

Last weekend, I gave a talk at the Bangalore Computer Vision Meetup group on “A history of Natural Language Processing”. I tried to compile together a perspective of the subject that may be unfamiliar to many. Take a look here:  BCVM | History and Evolution of NLP | Mr. Naren Sundaravaradan, Digital Aristotle

## DSL for Generative Models – Next Steps (61/365)

I’m going to sketch out the next few things to plan and code for the DSL library. I have so far provided the ability to describe the network but I haven’t yet provided a way to describe the distributions. For instance, in the HMM Model, we ought to be able to say what the distribution of Symbols is and so on. I’ll start with the ability to pick two distributions: Dirichlet and Multinomial. We can cover many models with just these two. When I provide a way to specify what type of distribution each node is, I should be able to change the distributions at will without affecting the network; for instance, using a continuous response as opposed to a discrete response in a HMM.

After this, I will want to create a function in the Gibbs module that can take in a Reader and sample the distributions. In the case of HMM, this would mean sampling the Transition distributions and the Symbols distributions by reading the network to figure out their priors and support.

Finally, with the sampled distributions and a Reader I will write a sampler that produces a new Reader. In the case of HMM, this means sampling the new Topic variables. These steps cover the (uncollapsed) Gibbs sampling technique.

Looking ahead even further, I intend to write a method to compute the density of the observed variables (having marginalized out the latent variables). I will do this using the annealed importance sampling method as described in this paper “Evaluation Methods for Topic Models” by Hanna M. Wallah et. al. In the case of the HMM, this amounts to computing the probability of Symbol given Symbols and Transition while marginalizing out the Topic variables.

## Lexicographical Ordering (60/365)

> import Data.Ord (comparing)


This is completely random but I thought it was a neat use of laziness. You are familiar with lexicographical ordering? Haskell’s compare of lists implements this.

ghci> [1,2] < [1,3]
True

ghci> [1,3,4] < [1,3,4,5]
True

ghci> [2,4] < [2,3]
False


Note that this favors shorter strings.

ghci> [1,2] < [1,2,3]
True

ghci> [1,3] < [1,3,4]
True


For whatever reason, I wanted to favor longer strings. How can we do this? First, note that the above is equivalent to doing the comparison after appending an infinite list of zeros to each operand (assuming we are using only positive numbers).

ghci> let aug xs = xs ++ cycle 
ghci> comparing aug [1,2] [1,3]
LT

ghci> comparing aug [1,3,4] [1,3,4,5]
LT

ghci> comparing aug [2,4] [2,3]
GT


If I, instead, append an infinite list of a large number I can get what I want.

ghci> let aug xs = xs ++ cycle 
ghci> comparing aug [1,2] [1,2,3]
GT

ghci> comparing aug [1,3] [1,3,4]
GT