∑_{1 ≤ i1 < i2 < … < ik ≤ n}x_{i1}x_{i2}…x_{ik}
The above expression can be more succintly written as the following product.
(1 + x_{1})(1 + x_{2})…(1 + x_{n})
The sign alternating version of this is given below where subsets of even size have + 1 as their coefficient and subsets of odd size have − 1 as their coefficient.
(1 − x_{1})(1 − x_{2})…(1 − x_{n})
Surprisingly, this can be computed as a determinant! Recall, that a determinant of a square matrix is
det(A)_{1 ≤ i, j ≤ n} = ∑_{w ∈ Sn}sgn(w)A_{1w(1)}…A_{nw(n)}
Now, construct a matrix A as follows 1) A_{ij} = x_{j} if i ≤ j, 2) A_{ij} = 1 if i = j + 1, and 3) A_{ij} = 0 otherwise.
Convince yourself that the determinant of A is (1 − x_{1})(1 − x_{2})…(1 − x_{n}).
]]>The meaning representation in this paper takes the form of a tree whose nodes have both natural language words and meaning representation tokens. They say that the meaning representation is variable-free but I have to trace another reference to see what that precisely means (for later). An example meaning representation is shown below
Every node in the tree takes on the following form where to are also semantic cateogries. Some examples are ‘River: largest(River)’ and ‘Num: count(State)’.
A hybrid tree is an extension of this tree that captures both the sentence and the meaning representation. The only difference is that every node can also emit NL tokens. The leaves of an MR are always NL tokens. Generation of a tree is viewed as a Markov process where 1) we start with a root production, 2) and we recursively expand its parameters, 3) at each node we can emit NL tokens.
Unlike in a PCFG parsing task where the correspondence between NL words and syntactic structures is available, the current model does not have access to this data. Thus, we need to compute the expected parameters from all possible tree derivations. The authors adapt the inside-outside algorithm for this purpose. I won’t go into the details at this point. Again, I’ll wait to link it with other papers.
I kind of lost interest towards the end because there is an extra re-ranking phase after finding the most likely hybrid-tree for a given sentence. Anyway, as this is a paper is pretty old (2008), let’s see what more recent papers do.
]]>The paper goes a little further than this. The goal is to extract a paraphrase rule as follows
where is a non-terminal, and are mix of terminal and non-terminal symbols where both share the same set of non-terminal symbols (given by the correspondence , and a feature vector .
Such a rule is construction from a syntactic machine translator, where two applied translation rules having the same syntactic construct and foreign phrase are matched
where once again and share the same non-terminals so that the above rule for pairing and can be constructed. The paper also defines a way to combine the feature vectors but I will skip that here.
]]>Of course, that paper did not consider the translation of the natural text into the program. Today’s paper by Wang et. al considers building a semantic parser: to translate a natural language text into a precise internal program that can be executed against an environment (database) to produce an answer.
This paper encodes the semantics in a langugage called lambda DCS based on lambda calculus. It’s compositional form is quite interesting. Every logical form is either a set of entities (unary) or a set of entity pairs (binary). A unary and a binary can be composed to produce a binary which are pairs in restricted to pairs whose second entities are present in . The other operators are the set operations on the unary terms.
They have also defined a canonical translation of programs into natural language.
"start date of student alice whose university is brown university"
R(date).(student.Alice ∩ university.Brown)
Since there are only so many ways to compose these programs, the canonical phrases can be generated easily. What they then do is use Amazon’s Mechanical Turk to convert these canonical phrases to something more familiar. The above example is converted to “When did alice start attending brown university?”. These pairs are then used to train a paraphrasing model
where is a feature vector and is a parameter vector; is the logical form (i.e. a program); is the canonical phrase; is the human phrase for the canonical phrase; and is the database (represented as a set of triples) that can be queried by the logical forms. I’ll come back to the details of the implementation as I find similar papers.
]]>So, the first claim of this paper is that “concepts have a language-like compositionality and encode probabilistic knowledge. These features allow them to be extended productively to new situations and support flexible reasoning and learning by probabilistic inference.” This is fairly uncontroversial.
What the authors consider then is a formal system capable of expressing the same. Their suggestion is a probabilistic programming language called Church. The develop various helper constructs in this language that help with capturing ‘randomness’, ‘conditionality’, and ‘queries’. There isn’t must to show about these programs, as they are just regular programs. Compositionality is obvious in that functions can be used and extended. Uncertainty is encoded by impure functions which return sampled values. But these functions are memoized so that when called by the same input the same answer is returned.
The paper doesn’t introduce anything computational but I’ll bring this back when I find papers that develop this concept computationally.
]]>This paper “Question Asking as Program Generation” is from NIPS 2017. They have collected 605 questions that humans have asked in the context of several partially revealed battleship game board with a view to gain information about the positions of the hidden ships. So each of these questions comes with a context (the game board). The authors aim to create a system to predict the question given a context.
In order to generate questions we need something compositional. The authors turn to programming. In this case, lisp programs. Every question is written as a lisp program. For example, How many tiles have the same colour as tile 2F?
is written as (setSize (colouredTiles (colour 2F)))
. These are converted by hand.
They have a limited set of allowed predicates and constructions as the domain is limited to this battleship board. As a result, they define a simple context-free grammar that covers all possible valid programs.
In the context of the battleship game, the human subjects were asked to come up with ‘useful’ questions. They were given plenty of training. It’s great having a grammar to generate valid programs but how many of them are actually useful? To that end, they define several features. I’ll describe these and skip over the probability model that is optimized.
The first is the Expected Information Gain value of a question. That is, how much uncertainty is reduced in knowing the actual board state after the answers (averaged over multiple contexts) to that question. So, questions which reveal more (on average) about the actual state get a higher score.
The second is complexity. Favoring the above feature only can lead to very long questions. So, this feature is thrown in to favor shorter questions. This measures the complexity of a question based on the grammar.
Another feature is the answer type. That is whether it is a yes/no question or a location or a colour.
What’s interesting is that they were able to generate questions (i.e. programs) given a context. It’s interesting to tackle a problem like this. It is similar to say a robot learning language with respect to its senses. Learning can be very quick because the language has to be computed unlike unsupervised language models.
]]>Symbols
is and so on. I’ll start with the ability to pick two distributions: Dirichlet and Multinomial. We can cover many models with just these two. When I provide a way to specify what type of distribution each node is, I should be able to change the distributions at will without affecting the network; for instance, using a continuous response as opposed to a discrete response in a HMM.
After this, I will want to create a function in the Gibbs
module that can take in a Reader
and sample the distributions. In the case of HMM, this would mean sampling the Transition
distributions and the Symbols
distributions by reading the network to figure out their priors and support.
Finally, with the sampled distributions and a Reader
I will write a sampler that produces a new Reader
. In the case of HMM, this means sampling the new Topic
variables. These steps cover the (uncollapsed) Gibbs sampling technique.
Looking ahead even further, I intend to write a method to compute the density of the observed variables (having marginalized out the latent variables). I will do this using the annealed importance sampling method as described in this paper “Evaluation Methods for Topic Models” by Hanna M. Wallah et. al. In the case of the HMM, this amounts to computing the probability of Symbol
given Symbols
and Transition
while marginalizing out the Topic
variables.
> import Data.Ord (comparing)
This is completely random but I thought it was a neat use of laziness. You are familiar with lexicographical ordering? Haskell’s compare of lists implements this.
ghci> [1,2] < [1,3]
True
ghci> [1,3,4] < [1,3,4,5]
True
ghci> [2,4] < [2,3]
False
Note that this favors shorter strings.
ghci> [1,2] < [1,2,3]
True
ghci> [1,3] < [1,3,4]
True
For whatever reason, I wanted to favor longer strings. How can we do this? First, note that the above is equivalent to doing the comparison after appending an infinite list of zeros to each operand (assuming we are using only positive numbers).
ghci> let aug xs = xs ++ cycle [0]
ghci> comparing aug [1,2] [1,3]
LT
ghci> comparing aug [1,3,4] [1,3,4,5]
LT
ghci> comparing aug [2,4] [2,3]
GT
If I, instead, append an infinite list of a large number I can get what I want.
ghci> let aug xs = xs ++ cycle [9999999]
ghci> comparing aug [1,2] [1,2,3]
GT
ghci> comparing aug [1,3] [1,3,4]
GT
]]>Auto drivers in Bangalore speak at least four languages with ease: Hindi, Kannada, Tamil, and Telugu. And, English is a given. Everyone ought to speak as many languages as possible without inhibitions and should be encouraged to do so even if all you manage to speak is a “good morning” in that language. There is just so much to gain including new friends and perspectives and traditions and in the most beautiful way – through all these differences – it makes us aware of just how similar (not identical) we all are.
That brings me back to how one can learn a language quickly. I’ve always been fascinated with individual words in different languages. Part of that has to do with simply wanting to know the origin of a word and you end up tracing it back to different countries and sometimes end up recounting the history of two countries along the way. Ponder on the words “algebra” or “philosophy”.
I studied Hindi in school fifteen years ago and I hated it. It was boring and tiresome. There was no joy in the learning process at all. Now, I wish I could speak every language in the world. At this point I point you to this wonderful book by Anthony Burgess “Language Made Plain”. I want to start by making some random notes on the easy and difficult things I am facing as I learn Hindi once again
Learning a new alphabet is a problem. I really which all languages used a common alphabet. For me, this means, if I have to properly learn Kannada there is little point in me trying to read the script. That will take time.
Find out as many audio/video resources as possible. Once I refreshed the basics I scoured the web for children stories to listen to and try to listen to them while going and coming from work.
I find listening is the easiest and I have shown huge improvements on this in a short time. Speaking is harder mainly because I don’t get the chance to practice it. It takes time to construct a grammatically correct sentence. Forget writing for now.
Learning/picking up vocabulary is easiest. I used to think this is the main problem but it’s negligible compared to the problem of sentence structure, gender modifications, and more importantly learning phrases in the zeitgeist.
I need a way to measure my progress easily and to make sure improvement is not stalled. I haven’t yet found a good way. What I do now is while listening to some audio I write down words I don’t recognize and then look them up later. But again, right now my main issue is with figuring out how to improve my spoken form given that I am unable to practice out in the open regularly.
I like writing. My idea right now is to find a good and simple way to exploit this as an alternative to having to find people daily to practice on. What I want is a way to quickly exercise my ability to create short and grammatically correct sentences in various tenses.
Anyway, I’ll make a few posts now and then about approaches do and do not work for me when learning a language.
]]>befb0f3cca0c212e368497e86f030aa96355be18
) the Reader
and Writer
interfaces and added it to Statistics.GModeling.Gibbs
. I’ve removed references to Support
and simply parameterized using a key type k
and value type v
.
> data Reader k v = Reader
> {
> -- | Number of available indices
> size :: Int
> -- | Read the value at the given key
> , readn :: Int -> k -> v
> -- | Create a copy for writing only
> , copy :: IO (Writer k v)
> }
>
> data Writer k v = Writer
> {
> -- | Write the value at the given key
> writen :: Int -> k -> v -> IO ()
> -- | Create a read-only copy
> , readOnly :: IO (Reader k v)
> }
I’ve also simplified the type of Indexed
and added an implementation of Reader
and Writer
for HMM in Statistics.GModeling.Models.HMM
.
133e22dc979d988706aafe52a346cee004f70ca5
) it contains
Statistics.GModeling.DSL
Statistics.GModeling.Models.HMM
Statistics.GModeling.Models.LDA
Statistics.GModeling.Models.FixedTreeLDA
Will continue building the pieces in upcoming posts.
]]>First, forget leap years. Let a year have days. Note that if you have 366 people in the room you are guaranteed that someone will share a birthday. On the other end of the spectrum, if there are only two people in the room then the probability that the two of them share a birthday is given by one minus the number of ways they cannot share a birthday divided by the number of ways we can assign them a birthday: .
In general, let’s say there are people in the room. The following gives the number of ways to assign different birthdays to each of the people.
The number of ways to assign any birthday to each of the people.
So, the probability that at least one pair out of will share a birthday is given by
Graphing it below. By the probability is already at .
This is painful. I’d rather have everything be probabilistic. I was pondering on ways to do this. Let’s say that the topic vector has dimension . Define two multinomial distributions and . Now define the response through this procedure
Essentially, the idea is and end up finding mutually exclusive dimensions such that either one or the other has a high value in order to produce the correct class label with high probability. I’d like to try it after I get done with the Gibbs code.
]]>There are two kinds of data I need access to. In the case of HMM, for example, the values of Topic
and Symbol
form the Support
to some distribution while the values of Topics
and Symbols
are the index of their respective distributions that’s currently active. So, consider the following signature
> type Support = Int
> -- Int -> a -> Either (a,Int) Support
For now, I’ll restrict Support
to only integers. The second type takes some integer index, a label, and returns either the index or the support depending on what is being asked. I’ll make this clear in the end with the HMM example.
For the library to have read access to the data I will provide this data type.
> data Reader a = Reader
> {
> size :: Int
> , read :: Int -> a -> Either (a,Int) Support
> , copy :: IO (Writer a)
> }
Field size
tells us how many indices are there [0..size-1]
; read
is the function we just saw, and copy
creates a writable copy of the data.
> data Writer a = Writer
> {
> write :: Int -> a -> Support -> IO ()
> , readOnly :: IO (Reader a)
> }
Here write
allows us to write at some Int
index for the label a
a new value. Let me take the HMM as an example again. For simplicity, let’s say we store the sequences as a list of lists.
> data HMMLabels = Alpha | Beta | Transition
> | Initial | Topic | Symbols | Symbol
>
>
> type Sequences = [[(Int,Int)]]
We can provide a reader.
> reader :: Sequences -> Reader HMMLabels
> reader ss = Reader
> {
> size = length (concat ss)
> , read = \idx -> let (i,j) = indices !! idx
> (topic,symbol) = ss !! i !! j
> (prev_topic,_) = ss !! i !! (j-1)
> in \name -> case name of
> Topic -> topic
> Symbol -> symbol
> Symbols -> (Topic,topic)
> Transition -> if j==0
> then (Initial,0)
> else (Topic,prev_topic)
> , copy = error "undefined"
> }
> where indices = concat $
> map (\(i,s) -> [(i,j) | j <- [0..length s-1]]) (zip [0..] ss)
Note how the signature of read
encourages caching; that is, the library can first supply only the index and then repeatedly query the resulting partial function for various names. This seems to be alright so far but I’ll only know if this holds up when I look at managing the distributions in the next post.
> import Control.Monad (msum)
> import Data.Maybe (mapMaybe)
> import Data.List (nub,(\\))
Recapping the DSL.
> data Indexed a = Only a | a :@ [a]
> data Edge a = Indexed a :-> a
> type Network a = [Edge a]
Enumerating the names.
> names :: Eq a => Network a -> [a]
> names = nub . concatMap f
> where f (Only a :-> b) = [a,b]
> f ((p :@ _) :-> a) = [p, a]
Enumerating the children.
> children :: Eq a => Network a -> a -> [a]
> children xs a = concatMap f xs
> where f (Only p :-> c) | p == a = [c]
> f ((p :@ is) :-> c) | p == a || elem a is = [c]
> f _ = []
Enumerating the parents.
> parents :: Eq a => Network a -> a -> [a]
> parents xs a = concatMap f xs
> where f (Only p :-> c) | c == a = [p]
> f ((p :@ _) :-> c) | c == a = [p]
> f _ = []
Enumerating the observed variables.
> observed :: Eq a => Network a -> [a]
> observed n = filter (null . children n) . names $ n
Enumerating the priors.
> prior :: Eq a => Network a -> [a]
> prior n = filter (null . parents n) . names $ n
Enumerating the latent variables.
> latent :: Eq a => Network a -> [a]
> latent xs = names xs \\ (prior xs ++ observed xs)
Index of a random variable
> indexOf :: Eq a => Network a -> a -> Maybe [a]
> indexOf xs a = msum (map f xs)
> where f ((p :@ is) :-> _) | p == a = Just is
> f _ = Nothing
Running on the hmm example.
> data HMMLabels = Alpha | Beta | Transition
> | Initial | Topic | Symbols | Symbol
> deriving (Show,Eq)
>
> hmm :: Network HMMLabels
> hmm =
> [
> Only Alpha :-> Transition
> , Only Beta :-> Symbols
> , (Transition :@ [Initial,Topic]) :-> Topic
> , (Symbols :@ [Topic]) :-> Symbol
> ]
ghci> observed hmm
[Symbol]
ghci> prior hmm
[Alpha,Beta]
ghci> latent hmm
[Transition,Symbols,Topic]
ghci> indexOf hmm Alpha
Nothing
ghci> indexOf hmm Transition
Just [Initial,Topic]
ghci> indexOf hmm Symbols
Just [Topic]
ghci> children hmm Alpha
[Transition]
ghci> parents hmm Alpha
[]
ghci> children hmm Topic
[Topic,Symbol]
ghci> parents hmm Topic
[Transition]
Next time, I want to look at how to specify the distributions.
]]>> data LDALabels = Alpha | Beta | Topics | Topic
> | Doc | Symbols | Symbol
> lda :: Network LDALabels
> lda =
> [
> Only Alpha :-> Docs
> , Only Beta :-> Symbols
> , (Topics :@ Doc) :-> Topic
> , (Symbols :@ Topic) :-> Symbol
> ]
Here is a topic model where topics are arranged in nodes of a fixed binary tree for each document. Let’s say the tree has depth , then the distribution is parameterized by a TopicPath
distribution (to select a leaf) and a TopicDepth
distribution (to select a node along the path).
> data LDATreeLabels =
> Alpha1 | Alpha2 | Beta
> | TopicDepth | TopicPath | Topic | Doc
> | Symbols | Symbol
>
> ldaTree :: Network LDATreeLabels
> ldaTree =
> [
> Only Alpha1 :-> TopicDepth
> , Only Alpha2 :-> TopicPath
> , Only Beta :-> Symbols
> , (TopicPath :@ Doc) :-> Topic
> , (TopicDepth :@ Doc) :-> Topic
> , (Symbols :@ Topic) :-> Symbol
> ]
I think it looks pretty good so far. Let’s see how it up once I start interpreting the DSL.
]]>I’ve spent a lot of time coding generative models from scratch and it’s repetitive and painful and error-prone and my current job has put me back in the thick of machine learning research and I hope I’ll get to use this. The problem with coding models from scratch is keeping track of the distributions and carefully constructing the conditional probabilities for each latent variable that needs to be sampled. For some reason, I wasn’t keen on using existing libraries and I wanted to have a go at making one myself.
After many false starts, I decided that I’d first write up a DSL that can be used to describe the generative model in the way it’s usually represented by the plate notation. The key to the plate notation is that it makes it easy to represent indexed distributions on top of the underlying bayesian network constructed by drawing nodes and edges.
I’ll keep the Hidden Markov Model as a running example. First, the user defines his own type that provides names for the various random variables.
> data HMMLabels = Alpha | Beta | Transition
> | Initial | Topic | Symbols | Symbol
The library now needs to provide a way to define the generative model on top of this. As a first step, we need to be able to define the plates; that is, to tell when a name is indexed by another name. In the case of the HMM, the symbol distributions are indexed by a topic and the topic distributions is either initial or is indexed by a topic.
Suppose the library provides the following
> data Indexed a = Only a | a :@ [a]
Then we can write
> -- Symbols :@ [Topic]
> -- Transition :@ [Initial,Topic]
And we can also define variables that stand on their own
> -- Only Alpha
> -- Only Beta
Next, is to allow the edges to be defined. Suppose we provide
> data Edge a = Indexed a :-> a
> type Network a = [Edge a]
The whole network can now be defined
> hmm :: Network HMMLabels
> hmm =
> [
> Only Alpha :-> Transition
> , Only Beta :-> Symbols
> , (Transition :@ [Initial,Topic]) :-> Topic
> , (Symbols :@ [Topic]) :-> Symbol
> ]
Next time, I’ll try to define a couple more models with this language to see if I am on the right track and then start writing an interpreter.
]]>(Forward direction) Suppose is a continuous random variable, then its distribution function is also continuous by definition. Hence, by definition of continuity.
(Reverse direction) Suppose for all and let be the corresponding distribution function. Let be a sequence of sets such that such that . Then, because is countably additive. Thus, and since we see that for any which is the definition of continuity.
]]>is continuous on the right but is not a distribution function in .
Take the first function. To show that it is continuous on the right, let and let . We need to show that there exists such that for all and within a distance of the following holds: . If , then let be the distance to the nearest point where . We see that in this case, picking a point within of will take on a value of and the difference will be less that . If , then and we can easily pick such that , meaning it will also take on a value of and satisfy . Thus, the first function is continuous on the right.
However, it is not a distribution function because it does not satisfy the requirement described in the last post that because if and the difference function evaluates to .
The second function is not a distribution function because . But it is continuous on the right because if we pick a point the function is constant on the interval and it is open on the right meaning we can always find a delta on the right to satify any .
]]>And a difference function
Then show that
Just to make clear the notation above (which confused me for a while), take the example where , then
So this is true for . I won’t do the general case.
]]>The way we do this is to start with a distribution function from which we derive a unique probability measure where . Here is a problem.
Let , then verify that .
Verify that where .
The proof for the following are similar: , , and .
]]>If is a Borel set, then so is its complement . We know that all sets in must have this form for some and (proved in book and is simple). The function belongs to and therefore . Consider
Then, since the function belongs to . But clearly does not belong to . Hence is not a Borel set and neither is .
]]>This is a Borel set because we can intersect the set of converging sequences and the set of sequences bounded from below.
This is the set of all sequences converging to a finite limit. My initial thought was to use the result from last time where we showed that the set of sequences bounded from above or below by
But we can’t then union all the sets of the above form for each possible limit because there are uncountably many choices. It would seem that we need a way to characterize limits without picking the value of the limit. Luckily, there is such a characterization for converging sequences of real numbers; namely, Cauchy sequences. A sequence is a Cauchy sequence, if for all , there is a positive integer such that for all , . All sequences of real numbers converging to a finite limit are also Cauchy sequences.
The Cauchy condition is true if and only if for all , . We can write the set of all converging sequences as
As a result, the set of all sequences converging to a finite limit is measurable.
]]>The book asks to show that certain sets are members of . Show that the following are Borel sets.
Take the first case. Note that is not satisfied if for every , . This can only happen if there are an infinite number of coordinates whose value is . Let
The set is a Borel set since we have constructed it as a countable union of Borel sets. Therefore, (this is also a countable union) is a Borel set. A similar argument is made for the other.
]]>The answer is no. We can show this by showing that the natural numbers is a strict subset of . Every natural number can be written as where and only a finite number of (because if an infinite number of the sum is ).
Note that since is a decomposition we can write every set in as a countable (because this is a -algebra) union of a subset of . This means we can encode every set in as where . First, since is countable there is a bijection between the natural numbers and , however, is not countable since we can have a countable number of .
]]>Given a set , and a set of subsets , we say that is an algebra if and is closed under unions and complementation. A -algebra adds to that the requirement that it also be closed under countable unions. The pair is called a measureable space.
Let be -algebras of . Are the following systems of sets -algebras?
The intersection of -algebras is also a -algebra because , and in the intersection is contained in both and .
However, the union of -algebras is not always a -algebra. For instance, let , , then their union does not contain .
]]>Since ,
To see that it is finitely additive, let be disjoint.
To show that it is not countably additive, consider the case where is the set of natural numbers. Then
This chapter introduces us to how we can extend the probability framework we had for finite sample spaces. The key problem we face is that in the finite case we were simply able to assign a probability to each and therefore get . But we can no longer follow this approach for an infinite sample space.
Anyway, the problem asks the following. Let be the set of rational numbers in . Let be the algebra of sets where each set takes on one of these forms: , , , and . Show that is a finitely additive set function but not countably additive.
Let be disjoint sets. Then, we see that is finitely additive.
To show that is not countably additve we need to show that we can come up with an infinite sequence of disjoint sets whose sum of probabilitites is not equal to the probability of its union. This should bring back memories of converging sequences. Consider the sets . It is clear that the union of these sets is . But
What it clarifies for me is the step in the EM algorithm where one introduces auxilliary variables – one for each value hidden value that the hidden variable can take on – which somehow turns out to be the conditional probability of given everything else. Why this turns out to be the case has always been a little fuzzy to me. And Dan’s post clarifies it greatly. The step that determines the auxilliary variables comes from equating the derivative of the log-likelihood and the derivative of the simpler function involving ’s and solving for . Please have a read.
]]>Consider the reading of a book. It’s an activity that proceeds in sequence as we read one word after another from left to right. Let be the sequence of words in a book. Let’s say that there are two actions we can take when we encounter a word .
What remains to figure out is what do we mean by ‘know the word’. Let’s get to this slowly. For now, consider the simplest form of memory. Let’s say that memory is a set to which we add an unknown word when learning and then remove a known word when recalling. I’ll end this post with some code and plots.
Let’s start with a typeclass for a memory model (for learning and recalling) that we can use again later. The learn
method updates the model with an entry and the recall
method returns a new model and also returns True
if the given a
was recalled.
> {-# LANGUAGE BangPatterns #-}
>
> import Data.Hashable
> import qualified Data.HashSet as HS
> import Data.Char
>
> class Mem m where
> recall :: (Eq a, Hashable a) => m a -> a -> (m a, Bool)
> learn :: (Eq a, Hashable a) => a -> m a -> m a
We create an instance for the simple model I described above.
> newtype SimpleMem a = SimpleMem (HS.HashSet a)
>
> instance Mem SimpleMem where
> recall (SimpleMem mem) a | HS.member a mem = (SimpleMem (HS.delete a mem), True)
> | otherwise = (SimpleMem mem, False)
> learn a (SimpleMem mem) = SimpleMem (HS.insert a mem)
Given a sequence of words we will now read it left to right and then label it if we need to learn the word and if we are recalling it.
> walk :: (Mem m, Eq a, Hashable a) => m a -> [a] -> [Int]
> walk initial = go initial
> where go _ [] = []
> go !mem (a:as) =
> case recall mem a of
> (mem', False) -> 1 : go (learn a mem') as
> (mem', True) -> (-1) : go mem' as
Finally, let’s have a simple way to read a text file. We won’t bother with stemming and all that.
> readTextFile :: String -> IO [String]
> readTextFile fp = readFile fp >>= return . words . map clean
> where clean c | isLetter c = toLower c
> | isMark c = c
> | otherwise = ' '
ghci> rs <- readTextFile "frankenstein.txt" >>= return . walk (SimpleMem HS.empty)
ghci> take 50 rs
[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,-1,1,1,1,1,1,1,1,-1,1,1,1,-1,1,1,-1,1,-1,1,-1,-1,1,1,-1,-1,-1]
ghci> take 50 $ scanl1 (+) rs
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,23,24,25,26,27,28,29,30,29,30,31,32,31,32,33,32,33,32,33,32,31,32,33,32,31,30]
For now, I leave you with a plot of the walk on frankenstein.txt and pride_and_prejudice.txt. Already note something curious
Number of words in frankenstein.txt is . Value of sum of random variables is .
Number of words in pride_and_prejudice.txt is . Value of sum of random variables is .
Let me give an example. As I mentioned, I am currently working out problems on random walks, martingales, and markov chains. Take the basic random walk which we construct as follows. Let the sample space be and where .
Now, we consider the sequence of these random variables
Here are some standard ways of interpreting this sequence of random variables
Given a 2-dimensional grid, represents the position after steps (starting from ) taken by either going up one or going to the right one.
Consider a gambling game with two players. If lets say player A gains one dollar from player B and when player B gains one dollar from player A. Suppose player A and B start with dollars and dollars. therefore represents the amount of money won by player after turns. If then player B has lost all his money. So we here can ask questions like what is the probability that player A or player B will be ruined (i.e. loses all his money).
This is all well and good and the examples generalize to more general random walks but I want to consider some completely left-field examples. I can’t guarantee they will lead anywhere and may be utterly rubbish but I think it will be an interesting exercise. See you next time!
]]>Since, is a stochastic matrix we can apply the Ergodic theorem which tells us that there exists with such that as . Thus, we see that .
To show that all eigenvalues of have a magnitude less than , note that
As a result, if an eigenvalue for an eigenvector , then . Similarly, if , then . Therefore, all eigenvalues must satisfy .
]]>Let be a Markov chain with values in and a function. Will the sequence form a Markov chain? Will the reversed sequence form a Markov chain?
For the first part, the answer is yes because
The reversed sequence will also form a Markov chain
Proceed as follows.
Proceed as follows.
As a quick note as to why ,
where are given functions, is a martingale.
Once again, we let the sequence of decompositions be . Then is -measurable because (1) is -measurable, (2) is -measurable, and (3) is -measurable if .
Next, we show that
is a martingale.
To show this we need to show that (1) is -measurable. This is clear because takes on a single value (the expectation) conditioned on each . Next, we need to show that (2) .
In terms of balanced parantheses this means we have extra open parantheses to make use of. Let’s say and then
There are ways to arrange votes for each candidate
For each of the valid ways we can insert an extra open paranthesis in possible places. After this, we can insert the next extra open paranthesis in possible places. So, we now have arrangements.
The last extra parantheses we know we have to place it at the beginning. But there are choices for the first paranthesis. So now have possible arrangements.
Finally, since the extra parantheses are also indistinguishable we have to divide by .
Thus the number of ways to arrange votes for candidate and votes for candidate where candidate always has the higher number of votes is
The probability that candidate always has the higher number of votes than candidate and ends up with and votes respectively is
We have in fact arrived at the solution without the use of martingales. Next time, let’s see why this example is used in the chapter on martingales.
]]>There is a reason why I want to do this because when I was thinking of a solution it didn’t strike me at all that the answer had to do with Catalan numbers even after I realized that counting the number of balanced parantheses is an equivalent problem. Rustiness annoys me. So, this time I want to come up a proof I’ll remember.
Suppose candidate ends up with only one vote more than candidate . We know that the first vote has to be for candiate . That leaves us with votes for each candidate that we have to arrange so that always stays on top. You will note that is equivalent to arranging pairs of parantheses so that they remain balanced.
Below is a procedure for generating all sequences of balanced parantheses.
> import Data.List
>
> gen_valid :: Int -> [String]
> gen_valid n = loop n n 0
> where
> loop 0 0 _ = [""]
> loop 0 b _ = [')' : s | s <- loop 0 (b-1) 0]
> loop a b k = ['(' : s | s <- loop (a-1) b (k+1)] ++
> if k > 0 then [')' : s | s <- loop a (b-1) (k-1)] else []
ghci> gen_valid 1
["()"]
ghci> gen_valid 2
["(())","()()"]
ghci> gen_valid 3
["((()))","(()())","(())()","()(())","()()()"]
ghci> gen_valid 4
["(((())))","((()()))","((())())","((()))()","(()(()))","(()()())","(()())()","(())(())","(())()()","()((()))","()(()())","()(())()","()()(())","()()()()"]
We also know that the total number of possible arrangements (valid and invalid) is given by . The Catalan number gives the number of valid arrangements as a fraction of all possible arrangements
One way to interpret this fraction is to say that for every valid arrangement there are corresponding invalid arrangements. Can we come up with a way to transform a valid arrangement into unique invalid arragements?
I suspect that it should be possible given that the invalid arrangments might have to do with inverting each of the parantheses in the sequence. For example, we can transform to or to these two and . What happens when we have nested parantheses? What does become? Note that to flip the internal parantheses alone is bad because we get ! It would seem that we should flip the parent before its children: that way we get these two and . So far so good. I now code the general procedure and check that it generates all arrangments from just the valid ones.
The following function extracts a top level balanced string and returns the rest.
> split :: String -> Maybe (String,String)
> split [] = Nothing
> split ss = Just $ splitAt (len+1) ss
> where len = length . takeWhile (>0) . tail . scanl (\x c -> if c=='(' then x+1 else x-1) 0 $ ss
ghci> split "(())()()"
Just ("(())","()()")
This function returns all top level balanced strings.
> splits :: String -> [String]
> splits = unfoldr split
ghci> splits "(())()()"
["(())","()","()"]
The following takes a valid sequence and generates invalid sequences.
> validToInvalids :: String -> [String]
> validToInvalids str = concat $ map (\i -> modAt i lst) [0..length lst-1]
> where lst = splits str
> change (_:xs) = ')' : init xs ++ "("
> modAt i xs = let (lhs,ss:rhs) = splitAt i xs
> in -- this flips the outer and leavs the inner the same
> concat (lhs ++ [change ss] ++ rhs) :
> -- this recurses into the inner and wraps with a flipped outer
> map (\x -> concat $ lhs ++ [")" ++ x ++ "("] ++ rhs) (validToInvalids (init . tail $ ss))
>
> choose :: Int -> Int -> Int
> choose n k = fact n `div` fact k `div` fact (n-k)
> where fact a = product [2..a]
Let’s check that all the invalid sequences generated are unique and that it sums up to all possible arrangements.
ghci> validToInvalids "()"
[")("]
ghci> validToInvalids "()()"
[")(()","())("]
ghci> validToInvalids "(())"
[")()(","))(("]
ghci> choose 6 3 == (length . nub . concat . map (\x -> x : validToInvalids x) $ gen_valid 3)
True
ghci> choose 8 4 == (length . nub . concat . map (\x -> x : validToInvalids x) $ gen_valid 4)
True
ghci> choose 10 5 == (length . nub . concat . map (\x -> x : validToInvalids x) $ gen_valid 5)
True
]]>Let’s start with this ballot problem. Let be a sequence of independently and identically distributed Bernoulli random variables. Let’s say each represents a vote either for candidate () or represents a vote for candidate (). Let . Suppose and candidate receives a total of votes and candidate receives a total of votes and compute the following probability that candidate was always ahead of candidate .
Let’s try to attack this combinatorially. The total number of assignments is given by because we can place votes in one of positions.
> choose :: Int -> Int -> Double
> choose n k = fact n / fact k / fact (n-k)
> where fact a = product [2..fromIntegral a]
ghci> choose (10+4) 4
1001.0
Now, the number of sequences in which candidate always has a higher number of votes is given by the following recursion.
> valid :: (Int,Int) -> Double
> valid (_,0) = 1
> valid (a,b) | a-b == 1 = valid (a,b-1)
> | otherwise = valid (a-1,b) + valid (a,b-1)
ghci> valid (10,4)
429.0
ghci> valid (10,4) / choose (10+4) 4 == (10-4) / (10+4)
True
We can easily speed up valid using memoization. A closed form solution should not be hard to come by. But next time let’s see how this chapter approaches this problem and how martingales play a part.
]]>Let be independent Bernoulli random variables with and and . If show that the sequence is a martingale.
Hence, the sequence of is a martingale. Show that the sequence is also a martingale.
Suppose that are two decompositions of the sample space where is finer than . Finer means that .
Let be a random variable. First, recall the expectation of a random variable with respect to a decomposition .
Note the special case when (i.e., when is -measurable).
Next, recall the generalized total probability formula
Suppose we took a conditional expection instead
This gets simplified if is a finer decomposition than because is now decomposed by
Therefore if
And in general if
]]>> add1 :: Integral a => a -> a -> a -> a
> add1 m x y = (x+y) `mod` m
>
> mult1 :: Integral a => a -> a -> a -> a
> mult1 m x y = (x*y) `mod` m
But this is error prone and cumbersome because someone using these two functions to do something like
> example :: Integral a => a -> a -> a -> a
> example m x y = (((x+y) `mod` x) * y) `mod` m
when they actually intended to mod by m
both times. You could say add a newtype
wrapper to m
to fix this but this still doesn’t stop the user from using two different modulus operations in example
.
The paper suggests many different ways to solving this issue and one of them is to use a reader monad and write add as add :: (Integral a, MonadReader a m) => a -> a -> a
which would certainly thread the same modulus through all operations. But it forces us to write monadic code when it is unnecessary.
What I realized was that we can still the reader structure but without the monad instance if we roll our own type with the reader state.
> newtype M a = M (a -> a)
>
> withModulus :: Integral a => a -> M a -> a
> withModulus m (M f) = f m
>
> instance Integral a => Num (M a) where
> (M f) + (M g) = M $ \s -> (f s + g s) `mod` s
> (M f) - (M g) = M $ \s -> (f s - g s) `mod` s
> (M f) * (M g) = M $ \s -> (f s * g s) `mod` s
> negate (M f) = M $ \s -> (- f s) `mod` s
> abs _ = error "Modular numbers are not signed"
> signum _ = error "Modular numbers are not signed"
> fromInteger n = M $ \s -> fromIntegral n `mod` s
This is very convenient.
ghci> withModulus 7 $ (10+3)*(3-4) + 8
2
This solution seems as convenient and safe as what the paper does for this particular example. Of course, the paper is solving a much more general problem but this struck me as a good solution for modular arithmetic.
]]>