Network: Blog | Home



Machine Learning Summer School Video’s Online

For the people who missed the Machine Learning Summer School here in Cambridge this year, the video’s are now available at videolectures.net! My personal favorites: David MacKay’s Information Theory, David Blei’s Topic Models, Iain Murray’s Markov Chain Monte Carlo and Tom Minka’s Approximate Inference. Enjoy …

Posted by Jurgen Van Gael 13:24 0 comments  



Statistics Fail

From (probably) the dumbest man on earth [link].

Posted by Jurgen Van Gael 09:23 1 comments  



The Elements of Statistical Learning

The Elements of Statistical Learning is an absolute classic for anyone wanting to do statistics/machine learning/data mining. I read that the second edition was out and debating whether I should spend the money on this new edition. Via John Cook I learned that the book is out on pdf (from their website). DOUBLE WIN: a) I’ve already paid once and get the upgrade for free, b) I know have a way to electronically search the book.

I also found out today that Koller and Friedman have just released their much anticipated book Probabilistic Graphical Models from MIT press. At a lengthy 1208 pages, this should provide enough reading for a few nights!

Posted by Jurgen Van Gael 22:09 0 comments  



Is Bayesian Statistics Easier?

Many people have debated the merits of Bayesian statistics versus Frequentist statistics: very recently there was this discussion in Bayesian Analysis by Gelman. It’s a fascinating discussion and I’m just learning to ins and outs of different arguments myself.

First, I want to point people to this (short) but very insightful article by Tony O’Hagan. There are many ways of looking at the difference between Bayesian and Frequentist methods and this article makes one particular (a bit philosophical) viewpoint very clear.

I’m certainly not knowledgeable enough to take a stand in this debate but one thing I do have a strong opinion about: Bayesian methods are much easier to understand! In most statistics 101 classes we are taught a bit of point estimation (e.g. estimating a mean, a variance) which I think is understandable, but then people get into hypothesis testing and suddenly have to reason about following statements: (quoting Wikipedia here): “Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?”. I certainly find this a really complicated notion to reason about. The Bayesian alternative would say something like: “What is the probability of generating the observed value for this particular model”. It is the same kind of statements as “What is the probability of generating a five from a fair dice”. I honestly think statistics would be a lot more understandable (and hence I think enjoyable) if we were teaching Bayesian statistics before getting into more advanced frequentist methods. I really wonder whether there are schools out there where the undergrad stats curriculum is based on the Bayesian approach?

EDIT - I fixed the link to O'Hagan's article.

Posted by Jurgen Van Gael 17:26 6 comments  



Dirichlet Distributions and Entropy

In between all the Netflix excitement I managed to read a paper that was mentioned by David Blei during his machine learning summer school talk: Entropy and Inference, Revisited by Ilya Nemenman, Fariel Shafee and William Bialek from NIPS 2002. This paper discusses some issues that arise when learning Dirichlet distributions. Since Dirichlet distributions are so common, I think these results should be more widely known.

[Given that I’ve been reading a lot of natural language processing papers where Dirichlet distributions are quite common, I’m surprised I haven’t run into this work before.]

First a little bit of introduction on Dirichlet distributions. The Dirichlet distribution is a “distribution over distribution”; in other words: a draw from a Dirichlet distribution is a vector of positive real numbers that sum up to one. The Dirichlet distribution is parameterized by a vector of positive real numbers which captures the mean and variance of the Dirichlet distribution. It is often very natural to work with a slightly constrained Dirichlet distribution called the symmetric Dirichlet distribution: in this case the vector of parameters to the Dirichlet are all the same number. This implies that the mean of the Dirichlet is a uniform distribution and the variance is captured by the magnitude for the vector of parameters. Let us denote with beta the parameter of the symmetric Dirichlet. Then, when \beta is small, samples from the Dirichlet will have high variance while with beta large, samples from the Dirichlet will have small variance. The plot below illustrates this idea for a Dirichlet with 1000 dimensions: the top plot has very small beta and hence a draw from this Dirichlet has only a few nonzero entries (hence high variance) while the Dirichlet with beta = 1, all entries of the sample have roughly the same magnitude (about 0.001).

image

Another way to approach the effect of beta is to look at the entropy of a sample from the Dirichlet distribution, denoted by S in the images below. The entropy of a Dirichlet draw is high when beta is large. More in particular, it is upper bounded by ln(D) where D is the dimensionality of the Dirichlet when beta approaches infinity and the Dirichlet distribution will approach a singular distribution at completely uniform discrete distribution. When beta approaches 0, a draw from a Dirichlet distribution approaches a delta peak on a random entry which is a distribution with entropy 1. The key problem the authors want to address is that when learning an unknown distribution, if we use a Dirichlet prior, beta pretty much fixes the allowed shapes while we might not have a good reason a priori to believe that what we want to learn is going to look like either one of these distributions.

The way the authors try to give insight into this problem is by computing the entropy of a random draw of a Dirichlet distribution. In equations, if we denote with S the entropy, it will be a random variable with distribution

image

Computing the full distribution is hard but the authors give a method to compute its mean and variance. The following picture shows the mean and variance of the entropy for draws of a Dirichlet distributions. A bit of notation: K is the dimensionality of the Dirichlet distribution, Xi is the mean entropy (as a function of beta) and sigma is the variance of the entropy as a function of beta.

image

As you can see from this plot, the entropy of Dirichlet draws is extremely peaked for even moderately large K. The authors give a detailed analysis of what this implies but the main take-away message is this: as you change beta, the entropy of the implied Dirichlet draws varies smoothly, however, because the variance of the entropy is very peaked, the a priori choice of beta almost completely fixes the entropy.

This is problematic as it means that unless our distribution is sampled almost completely, the estimate of the entropy is dominated by the choice of our prior beta. So how can we fix this? The authors suggest a scheme which I don’t completely understand but boils down to a mixture of Dirichlet distributions by specifying a prior distribution on beta as well.

This mixture idea ties in with something we did in our EMNLP 09 paper: when we were training our part-of-speech tagger we had to choose a prior for the distribution which specifies what words are generated for a particular part-of-speech tag. We know that we have part-of-speech tag classes that generate very few words (e.g. determiners, attributes, …) and a few classes that generate a lot of words (e.g. nouns, adjectives, …). At first, we chose a simple Dirichlet distribution (with fixed beta) as our prior and although the results were reasonable, we did run into the effect explained above: if we set beta to be large, we got very few states in the iHMM where each state outputs a lot of words. This is good to capture nouns and verbs but not good for other classes. Conversely, when we chose a small beta we got a lot of states in the iHMM each generating only a few words. Our next idea was to put a Gamma prior on beta; this helped a bit, but still assumed that there is only one value of beta (or one type of entropy distribution) which we want to learn. This is again unreasonable in the context of natural language. Finally, we chose to put a Dirichlet process prior on beta (with a Gamma base measure). This essentially allows different states in the iHMM to have different beta’s (but we only expect to see a few discrete beta’s).

“Entropy and Inference, Revisited” is one of those papers with a lot of intuitive explanations; hopefully it helps you make the right choice for priors in your next paper as well.

Posted by Jurgen Van Gael 21:20 4 comments  



Stats Clinic

We are starting a new machine learning related initiative in Cambridge: the walk-in Stats Clinic. The goal of our consulting service is to help applied researchers with their questions in statistics or machine learning. Starting October 21 from 16:30-18:00, we will be in meeting room 5 at the Centre for Mathematical Sciences every other week.

Posted by Jurgen Van Gael 11:36 0 comments  



May All Your Wishes Come True

I am trying to catch up with a stack of 109 papers which are on my “To Read” list. Today I read a very enjoyable paper called May all your wishes come true: A study of wishes and how to recognize them by some of my former UW-Madison friends and colleagues Andrew B. Goldberg, Nathanael Fillmore, David Andrzejewski, Zhiting Xu, Bryan Gibson and Xiaojin Zhu.

The story is as follows: in 2007, the Times Square Alliance asked people to send their wished to a Virtual Wishing Wall website with the promise to print these wishes on confetti and drop it from the sky at midnight December 31, 2007. The UW-Madison people got a hold of this corpus of wishes and did some statistical analysis.

A few highlights:

  • wishes follow Zipf’s law
  • topic wise, US people wish more for religion
  • topic wise, non-US people wish more for love, peace and travel
  • scope wise, US people wish to their country and family more frequently
  • scope wise, non-US people wish to the world more frequently

After the statistical analysis, the authors wanted to come up with a classifier that could detect wishes. Their approach consists of extracting wish templates in an unsupervised fashion. The idea is quite clever (and I believe more widely applicable): people often wished for “world peace” and “I wish for world peace”, or “health and happiness” and “I wish for health and happiness”. In other words, there is quite a bit of redundancy in how the wish is expressed. By matching similar wishes, the authors extract the “redundant” part and assume it is a template.

Matching wishes can be done as follows: you create a bipartite graph and loop over all wishes, if a wish contains another wish (“I wish for world peace” contains “world peace”), you create a content node on the left with the content part (e.g. “world peace”), a template part on the right with the redundancy (e.g. “I wish for”) and you connect these with a directed edge from the content to template node. When you’re done you loop over all template nodes and see whether any of them contain content words. If so, you introduce an edge from the template node to the content node. Finally, you score template nodes by their indegree – outdegree. I will refer you to table 4 to see that the results by this relatively simple algorithm are quite good.

The reason I believe this is more widely applicable is that we can imagine using this technique for extracting templates from web queries. People often query for facts: e.g. kids of Michael Jackson. Using the algorithm from this paper, we might be able to extract the most common templates which in turn would give us a good indication of what kind of answers search engines should be trying to answer. Just an idea …

Posted by Jurgen Van Gael 17:32 1 comments  



AI Bugs

I thought this was a very funny set of AI bloopers …

Posted by Jurgen Van Gael 22:33 0 comments  



ACL 09 & EMNLP 09

ACL-IJCNLP 2009 and EMNLP 2009 have just finished here in Singapore. As an outsider to the field I had a hard time following many talks but nonetheless enjoyed the conference. The highlight for me was the talk by Richard Sproat who wondered whether there exists a statistical test to check if a series of symbol sequences is actually a language? If this test would exist, we could use it to decide whether the set of symbols known as the Indus Valey Script is actually a language. Very fascinating stuff: I immediately bought “Lost Languages” by Andrew Robinson to learn more about the history of deciphering dead languages.

The paper had some very cool papers; the first one I really liked was Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling by Daichi Mochihashi et al. They build on the work of Yee Whye Teh and Sharon Goldwater who showed that Kneser-Ney language modelling is really an approximate version of a hierarchical Pitman-Yor based language model (HPYLM). The HPYLM starts from a unigram model over a fixed dictionary and hence doesn’t accommodate for out of vocabulary words. Daichi et al extended the HPYLM so that the base distribution is now an character infinity-gram that is itself an HPYLM (over characters). They call this model the nested HPYLM or NPYLM. There is no need for a vocabulary of words in the NPYLM, rather, the HPYLM base distribution is a distribution over arbitrary long strings. In addition the model will perform automatic word segmentation. The results are really promising: from their paper, consider the following unsegmented English text

lastly,shepicturedtoherselfhowthissamelittlesisterofhersw
ould,intheafter-time,beherselfagrownwoman;andhowshe
wouldkeep,throughallherriperyears,thesimpleandlovingh
eartofherchildhood:andhowshewouldgatheraboutherothe
rlittlechildren,andmaketheireyesbrightandeagerwithmany
astrangetale,perhapsevenwiththedreamofwonderlandoo
ngago:andhowshewouldfeelwithalltheirsimplesorrows,an
dndapleasureinalltheirsimplejoys,rememberingherownc
hild-life,andthehappysummerdays. […]

When the NPYLM is trained on this data, the following is found

last ly , she pictured to herself how this same little sis-
ter of her s would , inthe after - time , be herself agrown woman ; and how she would keep , through allher ripery ears , the simple and loving heart of her child hood : and how she would gather about her other little children ,and make theireyes bright and eager with many a strange tale , perhaps even with the dream of wonderland of longago : and how she would feel with all their simple sorrow s , and a pleasure in all their simple joys , remember ing her own child - life , and thehappy summerday s .

 
A note on the implementation of Hierarchical Dirichlet Processes by Phil Blunsom et al. In this paper the authors discuss how previous approximate inference schemes to the HDP collapsed Gibbs sampler can turn out to be quite bad. In this paper they propose a more efficient and exact algorithm for a collapsed Gibbs sampler for the HDP.
 
A few other papers I really enjoyed:

Posted by Jurgen Van Gael 23:06 2 comments  



To R or not to R

At the moment 90% of the research in our lab is done using Matlab. Some of us have tried Python but were dissapointed by how hard it was to get decent (read: native) numerical routines on par with Matlab. I've been developing on the .NET platform with a little bit of Java development (for Amazon's Elastic Map-Reduce) on the side. We've had a few discussions in our lab recently about switching to R for our machine learning research. A decision hasn't been made and we're wondering what the larger machine learning community thinks about this issue. Here's a list of pros and cons that I've come up with

Pros:

  • great support for statistical functions
  • great support from the statistical community
  • good (if not better) plotting that Matlab
  • reasonable fast (check out Dave's benchmarking)
  • free (very important!) if we want to run a job on 100 machines (e.g. in the cloud), I believe currently you need a matlab licence for each one
Cons:
  • pre-historic interface, doesn't compare to modern IDE's
  • debugging doesn't seem up to Matlab standards
  • not much support in machine learning community (?)
  • learning curve
The jury is still out on this one ... I really wonder how many machine learners out there already use R?

Posted by Jurgen Van Gael 04:14 20 comments