Tuesday, August 31, 2010

Finding Phrases - Two Statistical Approaches

In my previous post, I described how I used the Binomial Distribution to find the expected probability of a word n-gram, and compared the actual probability for a phrase to figure out if the n-gram is a phrase. I first learned about using a Binomial Distribution for Phrase Detection from Dr Konchady's Building Search Applications (BSA) book. The book describes two approaches, but the explanation is a bit sketchy, and I could not understand the methods fully at first, so I developed my own.

Over the past couple of weeks, I have been trying to figure out these two methods, and I think I now understand them well enough to be comfortable using them. For the benefit of others who are as perplexed as I was by the mental leaps in the book's explanation, I have included my notes in this post. Hopefully it will help in understanding the code.

As I mentioned in my update to last week's post, I now have a pluggable interface for filtering phrases, called IPhraseFilter. There are currently four implementations available in the repository, two from last week and two from this week. The LikelyPhraseMapper calls the desired implementation. To keep the post down to a reasonable length, I'll just describe the Phrase filter implementations - if you want to see how it fits in, you can refer to my previous post for the Mappers and Reducer classes, or download the whole thing from the jtmt repository.

One thing I liked in both the approaches described below, is that there is no choosing some arbitary cutoff number to ensure results are good. In both cases the cutoff is 0, ie if the determinant is negative, then it is dropped and if it is positive, it is retained.

Likelihood Ratio Approach

This approach makes two hypothesis, and compares the likelihood of the two hypothesis. The hypothesis are as follows:

 For any word bigram {w1,w2} in the document corpus:  H0 = w2 and w1 are independent H1 = w2 is dependent on w1 

We first calculate the actual probabilities for the independent and dependent cases from the occurrence counts of the words in the corpus, like so:

     | P(w2|w1)       | P(w2|¬w1) ---+----------------+---------------------------- H0 | p00 = n2 / N   | p01 = n2 / N H1 | p10 = n12 / n1 | p11 = (N - n12) / (N - n1)  where:   n1  = number of occurrences of w1 n2  = number of occurrences of w2 n12 = number of occurrences of {w1,w2} N   = number of words 

The derivation for the pij values in the table above are shown below.

 Under the null hypothesis H0:  p00  = P(w2|w1)          P(w2 ∩ w1)      = ------------           ... by definition of conditional           P(w1)                   probability          P(w2) * P(w1)      = --------------         ... since w1 and w2 are independent           P(w1)       = n2 / N                p01  = P(w2|¬w1)       = P(w2|w1)               ... since probability of w2 being preceded                                    by w1 is same as w2 being preceded by ¬w1      = n2 / N  Under the alternative hypothesis H1:  p10  = P(w2|w1)            P(w2 ∩ w1)      = -------------          ... by definition of conditional            P(w1)                  probability           n12 / N      = -------------           n1 / N       = n12 / n1  p11  = P(w2|¬w1)          P(w2 ∩ ¬w1)      = -------------          ... by definition of conditional            P(¬w1)                 probability           (n2 - n12) / N      = ------------------          (N - n1) / N       = (n2 - n12) / (N - n1) 

We then use the Binomial distribution to calculate the expected probabilities from the observed probabilities:

    | E(w2|w1)            | E(w2|¬w1) ---+---------------------+------------------------- H0 | b(n12; n2, p00)     | b((n12-n2); (N-n1), p01) H1 | b(n12; n2, p10)     | b((n12-n2); (N-n1), p11) 

We are now able to calculate the likelihood ratio for hypothesis H0 and H1. The likelihood ratio is defined as the ratio of the probability of a true positive and the probability of a false positive.

 L(H0)    = E(w2|w1) / E(w2|¬w1)    ... by definition of likelihood               b(n12; n2, p00)          = ------------------------             b((n12-n2); (N-n1), p00)                  n2Cn12 * p00n12 * (1 - p00)(n2-n12)          = --------------------------------------------------------            (N-n1)C(n12-n2) * p01(n12-n2) * (1 - p01)(N-n1-n12+n2)   L(H1)    = E(w2|w1) / E(w2|¬w1)    ... by definition of likelihood                b(n12; n2, p10)          = --------------------------             b((n12-n2); (N-n1), p11)                  n2Cn12 * p10n12 * (1 - p10)(n2 - n12)          = ----------------------------------------------------------            (N-n1)C(n12-n2) * p11(n12-n2) * (1 - p11)(N-n1-n2+n2) 

The ratio of the likelihood tells us how much more likely the dependence assumption is compared to the independence assumption. We can cancel out the Binomial coefficients in this case.

            p10n12 * (1-p10)(n2-n12) * p01(n12-n2) * (1-p01)(N-n1-n12+n2) R(H1/H0) = ------------------------------------------------------------------            p00n12 * (1-p00)(n2-n12) * p11(n12-n2) * (1-p11)(N-n1-n2+n2) 

LikelihoodRatioPhraseFilter

The code for the LikelihoodRatioPhraseFilter is shown below. The calculations are based on the formula for R(H1/H0) developed in the last section. Rather than return a yes/no answer from the filters, the filter returns the logarithm of R. So the client should check to see if the logarithm has a value greater than 0, ie, the likelihood of the dependence H1 (that the words are dependent on each other) is higher than H0 (that the words are independent).

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
// Source: src/main/java/net/sf/jtmt/phrase/filters/LikelihoodRatioPhraseFilter.java package net.sf.jtmt.phrase.filters;  /**  * Uses the ratio of the likelihood of an dependence hypothesis to the  * likelihood of the independence hypothesis for phrase bigrams to   * filter out unlikely phrases. The log of the ratio is returned as   * the determinant.  */ public class LikelihoodRatioPhraseFilter implements IPhraseFilter {    private long n;    // number of words in the corpus   private long n1;   // number of occurrences of leading word in corpus   private long n2;   // number of occurrences of trailing word in corpus   private long n12;  // number of occurrences of phrase in corpus      public LikelihoodRatioPhraseFilter(long n, long n1, long n2, long n12) {     this.n = n;     this.n1 = n1;     this.n2 = n2;     this.n12 = n12;   }      /**    * Returns the log of the likelihood ratio between the dependence hypothesis    * and the independence hypothesis. If the value is positive, then the     * dependence hypothesis is more likely than the independence hypothesis.    */   @Override   public double getPhraseDeterminant() {     double p00 = (double) n2 / (double) n;     double p01 = p00;     double p10 = (double) n12 / (double) n1;     double p11 = (double) (n2 - n12) / (double) (n - n1);     double llr = n12 * (Math.log(p10) - Math.log(p00)) +        (n2 - n12) * (Math.log(1 - p10) - Math.log(1 - p00)) +        (n12 - n2) * (Math.log(p01) - Math.log(p11)) +        (n - n1 - n12 + n2) * (Math.log(1 - p01) - Math.log(1 - p11));     return llr;   } } 

Results are similar to the previous two approaches, I get 91 phrases with my corpus of three electronic books, and a visual inspection shows that the phrases are quite nice. The top 10 examples are shown below:

 1  2  3  4  5  6  7  8  9 10 11
sperm whale     409.39943002166274 white whale     205.1643105456127 march hare      200.58942539551933 mr holmes       191.6443128051443 baker street    151.41014908579413 captain ahab    124.85466211360614 lord st         108.17580790802768 aye aye         91.86846403214781 cape horn       74.58676644925315 ivory leg       71.68738725289398 ... 

Chi-Square Approach

With the Chi-Square Approach, we assume the independence hypothesis H0, and then attempt to prove or disprove it. This is apparently a popular technique applicable in other fields as well. A high level description of this approach is available in the BSA book, and here is a more detailed explanation. Observed frequencies are first recorded in a contingency table.

       | w2       | ¬w2           | Total ------+----------+---------------+--------------------- w1    | n12      | n1 - n12      | n1 ¬w1   | n2 - n12 | N - n12       | N + n2 - 2n12 ------+----------+---------------+---------------------- Total | n2       | N + n1 - 2n12 | N + n1 + n2 - 2n12  where: n1  = number of occurrences of w1 n2  = number of occurrences of w2 n12 = number of occurrences of phrase {w1w2} N   = number of 2 word phrases in our corpus 

We then compute a table of expected frequencies from the marginal value of the observed frequencies.

       | w2       | ¬w2           ------+------------------------- w1    | E00      | E01 ¬w1   | E10      | E11 

Values of Eij are calculated thus:

 Eij = N * Pm(i) * Pm(j)            ... by Multiplicative law of probability            Nm(i)    Nm(j)     = N * ----- * -------             N        N      = Nm(i) * Nm(j) / N  where Pm(i) = marginal probability of event i             = P(wi|wj) + P(wi|¬wj)       Nm(i) = frequency corresponding to Pm(i). 

We then calculate the Chi-Square statistic &Chi2 and the degrees of freedom ν from our sample. In our code we will just use the ChiSquare test statistic and pass in two arrays of observed and expected frequencies, but the formula below would be used if we were to do this by hand.

               (Oij - Eij)2 Χ2 = Σi Σj -----------------                   Eij  ν = (r - 1) * (c - 1) = 1           ... where r = number of rows in table (2)                                               c = number of columns in table (2) 

For given α = 0.05, we find the critical value ν of a hypothetical "fair" distribution. We use the inverseCumulativeProbability() method for this, and it returns the point where P(x < ν) = α. Thus, 95% of the hypothetical distribution lies behind the critical value. If our computed value of Χ2 exceeds the critical value ν from the table, then we conclude that the (hypothetical) independence hypothesis is incorrect, and that the words {w1,w2} are indeed related as a phrase.

The ChiSquare approach is used by LingPipe for phrase extraction, according to the BSA book.

ChiSquarePhraseFilter

The code for the ChiSquarePhraseFilter is shown below. The theory leading to the code is explained above. Essentially, we compute the Χ2 statistic from the observed and expected probabilities (based on marginal probabilities of the observations), and compare it with the critical value at a 0.05 level of significance. If the result is positive, the observed value lies outside the 95% range of expected values based on a "fair" distribution, ie one where the words are independent and equally likely, and hence can be considered to be a phrase.

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
// Source: src/main/java/net/sf/jtmt/phrase/filters/ChiSquarePhraseFilter.java package net.sf.jtmt.phrase.filters;  import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.commons.math.MathException; import org.apache.commons.math.distribution.ChiSquaredDistribution; import org.apache.commons.math.distribution.ChiSquaredDistributionImpl; import org.apache.commons.math.linear.Array2DRowRealMatrix; import org.apache.commons.math.linear.RealMatrix; import org.apache.commons.math.linear.RealVector; import org.apache.commons.math.stat.inference.ChiSquareTest; import org.apache.commons.math.stat.inference.ChiSquareTestImpl;  /**  * Calculates the Chi-Squared statistic for the expected vs observed   * frequencies, and finds if the statistic is higher than the critical   * value computed at a 0.05 significance level (ie 95% coverage).  */ public class ChiSquarePhraseFilter implements IPhraseFilter {    private final Log log = LogFactory.getLog(getClass());      private static final double ALPHA = 0.05D;      private long n;    // number of 2 word phrases in corpus   private long n1;   // number of occurrences of leading word in corpus   private long n2;   // number of occurrences of trailing word in corpus   private long n12;  // number of occurrences of phrase in corpus    public ChiSquarePhraseFilter(long n, long n1, long n2, long n12) {     this.n = n;     this.n1 = n1;     this.n2 = n2;     this.n12 = n12;   }      @Override   public double getPhraseDeterminant() {     // set up contingency table of observed frequencies     RealMatrix obs = new Array2DRowRealMatrix(2, 2);     obs.setEntry(0, 0, n12);     obs.setEntry(0, 1, (n1 - n12));     obs.setEntry(1, 0, (n2 - n12));     obs.setEntry(1, 1, (n - n12));     // compute marginal frequencies     RealVector rowTotals = obs.getRowVector(0).add(obs.getRowVector(1));     RealVector colTotals = obs.getColumnVector(0).add(obs.getColumnVector(1));     double total = colTotals.getL1Norm();     // flatten contingency table of observed frequencies     long[] observed = new long[4];     int k = 0;     for (int i = 0; i < obs.getRowDimension(); i++) {       for (int j = 0; j < obs.getColumnDimension(); j++) {         observed[k++] = (long) obs.getEntry(i, j);       }     }     // compute expected frequencies based on marginal frequencies     double[] expected = new double[4];     k = 0;     for (int i = 0; i < obs.getRowDimension(); i++) {       for (int j = 0; j < obs.getColumnDimension(); j++) {         expected[k++] =            (double) colTotals.getEntry(i) * rowTotals.getEntry(j) / total;       }     }     // find the test statistic     ChiSquareTest test = new ChiSquareTestImpl();     double chiSquare = test.chiSquare(expected, observed);     // find the critical value     ChiSquaredDistribution dist = new ChiSquaredDistributionImpl(6.0D);     double criticalValue = 0.0D;     try {       criticalValue = dist.inverseCumulativeProbability(ALPHA);     } catch (MathException e) {       log.warn(e);     }     // return the difference between the test statistic and critical value     return (chiSquare - criticalValue);   } } 

As before, the results look quite believable. 103 phrases are recognized from this algorithm, and the top 10 are shown below. In the full output, there is considerable overlap, even though it does not show up here.

 1  2  3  4  5  6  7  8  9 10
irene adler       1811.5261599783912 march hare        1530.794875612239 um um             1194.2290051976697 project gutenberg 1136.9018182378147 copper beeches    862.8478133287713 don sebastian     789.5537133726627 st simon          788.6790165231548 jabez wilson      727.0858527238903 boscombe valley   708.5084636022881 ... 

Resources

While figuring out this stuff, I came across couple of very good online resources, which I would like to share here.

  • An Intuitive Explanation of Bayes' Theorem by Eliezer S. Yudkowsky - As the title suggests, this is a very easy to understand explanation of Bayes' theorem based on commonsense explanations. Read the rest of the articles on his site if you have some time, I found them quite interesting, maybe you will too.

  • AP* Statistics Tutorial - The last time I used any statistics other than mean and standard deviation was in college, and that was geared to solving toy problems. This time round, I read through the entire tutorial (takes about 2-3 hours) with a view to applying it to real-life (data mining) situations. If you are in a similar situation, I highly recommend reading through it. If nothing else, it will help you understand the more statistically oriented data mining algorithms available out there.


statistics formulas cheat sheet

No comments:

Post a Comment