## Learning a Language (59/365)

I want to start another aside because this is something I spend a significant amount of time thinking about. That is, how do you learn a new language when you are approaching 30? Living in India, I think it’s a shame that people communicate only in English at work (if you are in the tech industry) because India is one of the few countries that still speaks in a vast number of tongues (though the number is perishing at an alarming rate).

Auto drivers in Bangalore speak at least four languages with ease: Hindi, Kannada, Tamil, and Telugu. And, English is a given. Everyone ought to speak as many languages as possible without inhibitions and should be encouraged to do so even if all you manage to speak is a “good morning” in that language. There is just so much to gain including new friends and perspectives and traditions and in the most beautiful way – through all these differences – it makes us aware of just how similar (not identical) we all are.

That brings me back to how one can learn a language quickly. I’ve always been fascinated with individual words in different languages. Part of that has to do with simply wanting to know the origin of a word and you end up tracing it back to different countries and sometimes end up recounting the history of two countries along the way. Ponder on the words “algebra” or “philosophy”.

I studied Hindi in school fifteen years ago and I hated it. It was boring and tiresome. There was no joy in the learning process at all. Now, I wish I could speak every language in the world. At this point I point you to this wonderful book by Anthony Burgess “Language Made Plain”. I want to start by making some random notes on the easy and difficult things I am facing as I learn Hindi once again

1. Learning a new alphabet is a problem. I really which all languages used a common alphabet. For me, this means, if I have to properly learn Kannada there is little point in me trying to read the script. That will take time.

2. Find out as many audio/video resources as possible. Once I refreshed the basics I scoured the web for children stories to listen to and try to listen to them while going and coming from work.

3. I find listening is the easiest and I have shown huge improvements on this in a short time. Speaking is harder mainly because I don’t get the chance to practice it. It takes time to construct a grammatically correct sentence. Forget writing for now.

4. Learning/picking up vocabulary is easiest. I used to think this is the main problem but it’s negligible compared to the problem of sentence structure, gender modifications, and more importantly learning phrases in the zeitgeist.

5. I need a way to measure my progress easily and to make sure improvement is not stalled. I haven’t yet found a good way. What I do now is while listening to some audio I write down words I don’t recognize and then look them up later. But again, right now my main issue is with figuring out how to improve my spoken form given that I am unable to practice out in the open regularly.

6. I like writing. My idea right now is to find a good and simple way to exploit this as an alternative to having to find people daily to practice on. What I want is a way to quickly exercise my ability to create short and grammatically correct sentences in various tenses.

Anyway, I’ll make a few posts now and then about approaches do and do not work for me when learning a language.

## DSL for Generative Models (58/365)

I’ve now cleaned up (in befb0f3cca0c212e368497e86f030aa96355be18) the Reader and Writer interfaces and added it to Statistics.GModeling.Gibbs. I’ve removed references to Support and simply parameterized using a key type k and value type v.

> data Reader k v = Reader
>   {
>     -- | Number of available indices
>     size :: Int
>     -- | Read the value at the given key
>   , readn :: Int -> k -> v
>     -- | Create a copy for writing only
>   , copy :: IO (Writer k v)
>   }
>
> data Writer k v = Writer
>   {
>     -- | Write the value at the given key
>     writen :: Int -> k -> v -> IO ()
>     -- | Create a read-only copy
>   }


I’ve also simplified the type of Indexed and added an implementation of Reader and Writer for HMM in Statistics.GModeling.Models.HMM.

## DSL for Generative Models (57/365)

I am putting together what I have so far in a repository here. So far, (133e22dc979d988706aafe52a346cee004f70ca5) it contains

1. Statistics.GModeling.DSL
2. Statistics.GModeling.Models.HMM
3. Statistics.GModeling.Models.LDA
4. Statistics.GModeling.Models.FixedTreeLDA

Will continue building the pieces in upcoming posts.

## Sharing a Birthday (56/365)

I think most have heard something like you only need suprisingly few people in a room before two people in the room end up sharing a birthday. But I never bothered to work it out. Let me do that.

First, forget leap years. Let a year have $365$ days. Note that if you have 366 people in the room you are guaranteed that someone will share a birthday. On the other end of the spectrum, if there are only two people in the room then the probability that the two of them share a birthday is given by one minus the number of ways they cannot share a birthday divided by the number of ways we can assign them a birthday: $\frac{(365)(364)}{(365)(365)}$.

In general, let’s say there are $k$ people in the room. The following gives the number of ways to assign different birthdays to each of the $k$ people.

$\displaystyle D(k) = (365)(365-1)(365-2)\dots(365-k+1) = (365)_k \\ \text{Note that when } k > 365 \text{ we get } D_k = 0$

The number of ways to assign any birthday to each of the $k$ people.

$\displaystyle N(k) = 365^k$

So, the probability that at least one pair out of $k$ will share a birthday is given by

$\displaystyle \frac{D(k)}{N(k)} = 1 - \frac{(365)_k}{365^k}$

Graphing it below. By $k=41$ the probability is already at $0.9$.

## Response Variable (55/365)

A quick aside. I was thinking about how response variables are attached to generative models. For instance, if we want to say have binary classification on documents we would normally 1) take the dot product the topic vector with a global vector of coefficients and then 2) pass it through a logistic function. This is fine, but the problem is that if we were sampling the topic model using a Gibbs sampler we have to use some form of gradient descent to compute the dot product coefficients and to compute the coefficients of the logistic function.

This is painful. I’d rather have everything be probabilistic. I was pondering on ways to do this. Let’s say that the topic vector has dimension $n$. Define two multinomial distributions $\{t_i\}$ and $\{f_i\}$. Now define the response through this procedure

1. Given topic distribution $z$
2. $i \gets t$
3. $j \gets f$
4. response $\gets \text{Bernoulli}\left( \frac{z_i}{z_i + z_j} \right)$

Essentially, the idea is $t$ and $f$ end up finding mutually exclusive dimensions such that either one or the other has a high value in order to produce the correct class label with high probability. I’d like to try it after I get done with the Gibbs code.

## DSL for Generative Models – Interpreter (54/365)

The next step is for the library to have access to the latent variable data. I also don’t want the library to decide how to store the data because the user will have a much better idea of what is the most efficient way to store it. In terms of the library, I will need both read and write access to this data.

There are two kinds of data I need access to. In the case of HMM, for example, the values of Topic and Symbol form the Support to some distribution while the values of Topics and Symbols are the index of their respective distributions that’s currently active. So, consider the following signature

> type Support = Int
> -- Int -> a -> Either (a,Int) Support


For now, I’ll restrict Support to only integers. The second type takes some integer index, a label, and returns either the index or the support depending on what is being asked. I’ll make this clear in the end with the HMM example.

> data Reader a = Reader
>   {
>     size :: Int
>   , read :: Int -> a -> Either (a,Int) Support
>   , copy :: IO (Writer a)
>   }


Field size tells us how many indices are there [0..size-1]; read is the function we just saw, and copy creates a writable copy of the data.

> data Writer a = Writer
>   {
>     write :: Int -> a -> Support -> IO ()
>   }


Here write allows us to write at some Int index for the label a a new value. Let me take the HMM as an example again. For simplicity, let’s say we store the sequences as a list of lists.

> data HMMLabels = Alpha | Beta | Transition
>                | Initial | Topic | Symbols | Symbol
>
>
> type Sequences = [[(Int,Int)]]


> reader :: Sequences -> Reader HMMLabels
>   {
>     size = length (concat ss)
>   , read = \idx -> let (i,j) = indices !! idx
>                        (topic,symbol) = ss !! i !! j
>                        (prev_topic,_) = ss !! i !! (j-1)
>                    in \name -> case name of
>                                  Topic -> topic
>                                  Symbol -> symbol
>                                  Symbols -> (Topic,topic)
>                                  Transition -> if j==0
>                                                then (Initial,0)
>                                                else (Topic,prev_topic)
>   , copy = error "undefined"
>   }
>   where indices = concat $> map ($$i,s) -> [(i,j) | j <- [0..length s-1]]) (zip [0..] ss)  Note how the signature of read encourages caching; that is, the library can first supply only the index and then repeatedly query the resulting partial function for various names. This seems to be alright so far but I’ll only know if this holds up when I look at managing the distributions in the next post. Posted in Uncategorized | Tagged , , | Leave a comment ## DSL for Generative Models – Interpreter (53/365) In this post, I write some functions to interpret the DSL. Specifically, I present some functions to figure out the children and parents of a node and discover what the prior, observed, and latent variables are. > import Control.Monad (msum) > import Data.Maybe (mapMaybe) > import Data.List (nub,(\$$)  Recapping the DSL. > data Indexed a = Only a | a :@ [a] > data Edge a = Indexed a :-> a > type Network a = [Edge a]  Enumerating the names. > names :: Eq a => Network a -> [a] > names = nub . concatMap f > where f (Only a :-> b) = [a,b] > f ((p :@ _) :-> a) = [p, a]  Enumerating the children. > children :: Eq a => Network a -> a -> [a] > children xs a = concatMap f xs > where f (Only p :-> c) | p == a = [c] > f ((p :@ is) :-> c) | p == a || elem a is = [c] > f _ = []  Enumerating the parents. > parents :: Eq a => Network a -> a -> [a] > parents xs a = concatMap f xs > where f (Only p :-> c) | c == a = [p] > f ((p :@ _) :-> c) | c == a = [p] > f _ = []  Enumerating the observed variables. > observed :: Eq a => Network a -> [a] > observed n = filter (null . children n) . names$ n


Enumerating the priors.

> prior :: Eq a => Network a -> [a]
> prior n = filter (null . parents n) . names \$ n


Enumerating the latent variables.

> latent :: Eq a => Network a -> [a]
> latent xs = names xs \\ (prior xs ++ observed xs)


Index of a random variable

> indexOf :: Eq a => Network a -> a -> Maybe [a]
> indexOf xs a = msum (map f xs)
>   where f ((p :@ is) :-> _) | p == a = Just is
>         f _ = Nothing


Running on the hmm example.

> data HMMLabels = Alpha | Beta | Transition
>                | Initial | Topic | Symbols | Symbol
>     deriving (Show,Eq)
>
> hmm :: Network HMMLabels
> hmm =
>   [
>     Only Alpha                      :-> Transition
>   , Only Beta                       :-> Symbols
>   , (Transition :@ [Initial,Topic]) :-> Topic
>   , (Symbols :@ [Topic])            :-> Symbol
>   ]

ghci> observed hmm
[Symbol]

ghci> prior hmm
[Alpha,Beta]

ghci> latent hmm
[Transition,Symbols,Topic]

ghci> indexOf hmm Alpha
Nothing

ghci> indexOf hmm Transition
Just [Initial,Topic]

ghci> indexOf hmm Symbols
Just [Topic]

ghci> children hmm Alpha
[Transition]

ghci> parents hmm Alpha
[]

ghci> children hmm Topic
[Topic,Symbol]

ghci> parents hmm Topic
[Transition]


Next time, I want to look at how to specify the distributions.

## DSL for Generative Models – Examples (52/365)

In the previous post I attempted to introduce a DSL for probabilistic models inspired by the plate notation. Let’s try to see if we can define LDA with it.

> data LDALabels = Alpha | Beta | Topics | Topic
>                | Doc | Symbols | Symbol
> lda :: Network LDALabels
> lda =
>   [
>     Only Alpha :-> Docs
>   , Only Beta :-> Symbols
>   , (Topics :@ Doc) :-> Topic
>   , (Symbols :@ Topic) :-> Symbol
>   ]


Here is a topic model where topics are arranged in nodes of a fixed binary tree for each document. Let’s say the tree has depth $d$, then the distribution is parameterized by a TopicPath distribution (to select a leaf) and a TopicDepth distribution (to select a node along the path).

> data LDATreeLabels =
>     Alpha1 | Alpha2 | Beta
>   | TopicDepth | TopicPath | Topic | Doc
>   | Symbols | Symbol
>
> ldaTree :: Network LDATreeLabels
> ldaTree =
>   [
>     Only Alpha1         :-> TopicDepth
>   , Only Alpha2         :-> TopicPath
>   , Only Beta           :-> Symbols
>   , (TopicPath :@ Doc)  :-> Topic
>   , (TopicDepth :@ Doc) :-> Topic
>   , (Symbols :@ Topic)  :-> Symbol
>   ]


I think it looks pretty good so far. Let’s see how it up once I start interpreting the DSL.

## DSL for Generative Models (51/365)

The backlog becomes longer. I’ve changed jobs two weeks ago and it has upset my routine. No matter. Here we go again. I want to deviate from my problem solving mode for a while and use up a few posts to develop a little library for inference and use of probabilistic models. Why?

I’ve spent a lot of time coding generative models from scratch and it’s repetitive and painful and error-prone and my current job has put me back in the thick of machine learning research and I hope I’ll get to use this. The problem with coding models from scratch is keeping track of the distributions and carefully constructing the conditional probabilities for each latent variable that needs to be sampled. For some reason, I wasn’t keen on using existing libraries and I wanted to have a go at making one myself.

After many false starts, I decided that I’d first write up a DSL that can be used to describe the generative model in the way it’s usually represented by the plate notation. The key to the plate notation is that it makes it easy to represent indexed distributions on top of the underlying bayesian network constructed by drawing nodes and edges.

I’ll keep the Hidden Markov Model as a running example. First, the user defines his own type that provides names for the various random variables.

> data HMMLabels = Alpha | Beta | Transition
>                | Initial | Topic | Symbols | Symbol


The library now needs to provide a way to define the generative model on top of this. As a first step, we need to be able to define the plates; that is, to tell when a name is indexed by another name. In the case of the HMM, the symbol distributions are indexed by a topic and the topic distributions is either initial or is indexed by a topic.

Suppose the library provides the following

> data Indexed a = Only a | a :@ [a]


Then we can write

> -- Symbols :@ [Topic]
> -- Transition :@ [Initial,Topic]


And we can also define variables that stand on their own

> -- Only Alpha
> -- Only Beta


Next, is to allow the edges to be defined. Suppose we provide

> data Edge a = Indexed a :-> a
> type Network a = [Edge a]


The whole network can now be defined

> hmm :: Network HMMLabels
> hmm =
>   [
>     Only Alpha                      :-> Transition
>   , Only Beta                       :-> Symbols
>   , (Transition :@ [Initial,Topic]) :-> Topic
>   , (Symbols :@ [Topic])            :-> Symbol
>   ]


Next time, I’ll try to define a couple more models with this language to see if I am on the right track and then start writing an interpreter.

Posted in Uncategorized | Tagged , , | 1 Comment

## Random Variables – Problem (50/365)

Moving on to the next chapter “Random Variables – I”, take a look at the following problem. Show that the random variable $\theta$ is continuous if and only if $P(\theta = x) = 0$ for all $x \in \mathbb{R}$.

(Forward direction) Suppose $\theta$ is a continuous random variable, then its distribution function $F$ is also continuous by definition. Hence, $P(\theta = x) = F(x) - F(x-) = F(x) - \lim_{y \rightarrow x} F(y) = 0$ by definition of continuity.

(Reverse direction) Suppose $P(\theta = x) = 0$ for all $x \in \mathbb{R}$ and let $F$ be the corresponding distribution function. Let $\{A_n\}$ be a sequence of sets such that $A_n \supseteq A_{n+1}$ such that $\cap_{n=1}^\infty A_n = \{x\}$. Then, $P(\cap_{n=1}^\infty A_n) = \lim_n P(A_n)$ because $P$ is countably additive. Thus, $\lim_n P(A_n) = P(x)$ and since $P(a,b] = F(b) - F(a)$ we see that $\lim_{x \rightarrow c} F(x) = F(c)$ for any $c$ which is the definition of continuity.