precalculus problems online: Finding Phrases

In my previous post, I described how I used the Binomial Distribution to find the expected probability of a word n-gram, and compared the actual probability for a phrase to figure out if the n-gram is a phrase. I first learned about using a Binomial Distribution for Phrase Detection from Dr Konchady's Building Search Applications (BSA) book. The book describes two approaches, but the explanation is a bit sketchy, and I could not understand the methods fully at first, so I developed my own.

Over the past couple of weeks, I have been trying to figure out these two methods, and I think I now understand them well enough to be comfortable using them. For the benefit of others who are as perplexed as I was by the mental leaps in the book's explanation, I have included my notes in this post. Hopefully it will help in understanding the code.

As I mentioned in my update to last week's post, I now have a pluggable interface for filtering phrases, called IPhraseFilter. There are currently four implementations available in the repository, two from last week and two from this week. The LikelyPhraseMapper calls the desired implementation. To keep the post down to a reasonable length, I'll just describe the Phrase filter implementations - if you want to see how it fits in, you can refer to my previous post for the Mappers and Reducer classes, or download the whole thing from the jtmt repository.

One thing I liked in both the approaches described below, is that there is no choosing some arbitary cutoff number to ensure results are good. In both cases the cutoff is 0, ie if the determinant is negative, then it is dropped and if it is positive, it is retained.

Likelihood Ratio Approach

This approach makes two hypothesis, and compares the likelihood of the two hypothesis. The hypothesis are as follows:

 For any word bigram {w1,w2} in the document corpus:  H₀ = w₂ and w₁ are independent H₁ = w₂ is dependent on w₁

We first calculate the actual probabilities for the independent and dependent cases from the occurrence counts of the words in the corpus, like so:

     | P(w₂|w₁)       | P(w₂|¬w₁) ---+----------------+---------------------------- H₀ | p₀₀ = n₂ / N   | p₀₁ = n₂ / N H₁ | p₁₀ = n₁₂ / n₁ | p₁₁ = (N - n₁₂) / (N - n₁)  where:   n₁  = number of occurrences of w₁ n₂  = number of occurrences of w₂ n₁₂ = number of occurrences of {w₁,w₂} N   = number of words

The derivation for the p_ij values in the table above are shown below.

 Under the null hypothesis H₀:  p₀₀  = P(w₂|w₁)          P(w₂ ∩ w₁)      = ------------           ... by definition of conditional           P(w₁)                   probability          P(w₂) * P(w₁)      = --------------         ... since w₁ and w₂ are independent           P(w₁)       = n2 / N                p₀₁  = P(w₂|¬w₁)       = P(w₂|w₁)               ... since probability of w2 being preceded                                    by w1 is same as w2 being preceded by ¬w1      = n2 / N  Under the alternative hypothesis H₁:  p₁₀  = P(w₂|w₁)            P(w₂ ∩ w₁)      = -------------          ... by definition of conditional            P(w₁)                  probability           n₁₂ / N      = -------------           n₁ / N       = n₁₂ / n₁  p₁₁  = P(w₂|¬w₁)          P(w₂ ∩ ¬w₁)      = -------------          ... by definition of conditional            P(¬w₁)                 probability           (n₂ - n₁₂) / N      = ------------------          (N - n₁) / N       = (n₂ - n₁₂) / (N - n₁)

We then use the Binomial distribution to calculate the expected probabilities from the observed probabilities:

    | E(w₂|w₁)            | E(w₂|¬w₁) ---+---------------------+------------------------- H₀ | b(n₁₂; n₂, p₀₀)     | b((n₁₂-n₂); (N-n₁), p₀₁) H₁ | b(n₁₂; n₂, p₁₀)     | b((n₁₂-n₂); (N-n₁), p₁₁)

We are now able to calculate the likelihood ratio for hypothesis H₀ and H₁. The likelihood ratio is defined as the ratio of the probability of a true positive and the probability of a false positive.

 L(H₀)    = E(w₂|w₁) / E(w₂|¬w₁)    ... by definition of likelihood               b(n₁₂; n₂, p₀₀)          = ------------------------             b((n₁₂-n₂); (N-n₁), p₀₀)                  ^n₂C_n₁₂ * p₀₀^n₁₂ * (1 - p₀₀)^{(n₂-n₁₂)}          = --------------------------------------------------------            ^(N-n₁)C_{(n₁₂-n₂)} * p₀₁^{(n₁₂-n₂)} * (1 - p₀₁)^{(N-n₁-n₁₂+n₂)}   L(H₁)    = E(w₂|w₁) / E(w₂|¬w₁)    ... by definition of likelihood                b(n₁₂; n₂, p₁₀)          = --------------------------             b((n₁₂-n₂); (N-n₁), p₁₁)                  ^n₂C_n₁₂ * p₁₀^n₁₂ * (1 - p₁₀)^{(n₂ - n₁₂)}          = ----------------------------------------------------------            ^(N-n₁)C_{(n₁₂-n₂)} * p₁₁^{(n₁₂-n₂)} * (1 - p₁₁)^{(N-n₁-n₂+n₂)}

The ratio of the likelihood tells us how much more likely the dependence assumption is compared to the independence assumption. We can cancel out the Binomial coefficients in this case.

            p₁₀^n₁₂ * (1-p₁₀)^{(n₂-n₁₂)} * p₀₁^{(n₁₂-n₂)} * (1-p₀₁)^{(N-n₁-n₁₂+n₂)} R(H₁/H₀) = ------------------------------------------------------------------            p₀₀^n₁₂ * (1-p₀₀)^{(n₂-n₁₂)} * p₁₁^{(n₁₂-n₂)} * (1-p₁₁)^{(N-n₁-n₂+n₂)}

LikelihoodRatioPhraseFilter

The code for the LikelihoodRatioPhraseFilter is shown below. The calculations are based on the formula for R(H₁/H₀) developed in the last section. Rather than return a yes/no answer from the filters, the filter returns the logarithm of R. So the client should check to see if the logarithm has a value greater than 0, ie, the likelihood of the dependence H₁ (that the words are dependent on each other) is higher than H₀ (that the words are independent).

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

// Source: src/main/java/net/sf/jtmt/phrase/filters/LikelihoodRatioPhraseFilter.java package net.sf.jtmt.phrase.filters;  /**  * Uses the ratio of the likelihood of an dependence hypothesis to the  * likelihood of the independence hypothesis for phrase bigrams to   * filter out unlikely phrases. The log of the ratio is returned as   * the determinant.  */ public class LikelihoodRatioPhraseFilter implements IPhraseFilter {    private long n;    // number of words in the corpus   private long n1;   // number of occurrences of leading word in corpus   private long n2;   // number of occurrences of trailing word in corpus   private long n12;  // number of occurrences of phrase in corpus      public LikelihoodRatioPhraseFilter(long n, long n1, long n2, long n12) {     this.n = n;     this.n1 = n1;     this.n2 = n2;     this.n12 = n12;   }      /**    * Returns the log of the likelihood ratio between the dependence hypothesis    * and the independence hypothesis. If the value is positive, then the     * dependence hypothesis is more likely than the independence hypothesis.    */   @Override   public double getPhraseDeterminant() {     double p00 = (double) n2 / (double) n;     double p01 = p00;     double p10 = (double) n12 / (double) n1;     double p11 = (double) (n2 - n12) / (double) (n - n1);     double llr = n12 * (Math.log(p10) - Math.log(p00)) +        (n2 - n12) * (Math.log(1 - p10) - Math.log(1 - p00)) +        (n12 - n2) * (Math.log(p01) - Math.log(p11)) +        (n - n1 - n12 + n2) * (Math.log(1 - p01) - Math.log(1 - p11));     return llr;   } }

Results are similar to the previous two approaches, I get 91 phrases with my corpus of three electronic books, and a visual inspection shows that the phrases are quite nice. The top 10 examples are shown below:

 1  2  3  4  5  6  7  8  9 10 11

sperm whale     409.39943002166274 white whale     205.1643105456127 march hare      200.58942539551933 mr holmes       191.6443128051443 baker street    151.41014908579413 captain ahab    124.85466211360614 lord st         108.17580790802768 aye aye         91.86846403214781 cape horn       74.58676644925315 ivory leg       71.68738725289398 ...

Chi-Square Approach

With the Chi-Square Approach, we assume the independence hypothesis H₀, and then attempt to prove or disprove it. This is apparently a popular technique applicable in other fields as well. A high level description of this approach is available in the BSA book, and here is a more detailed explanation. Observed frequencies are first recorded in a contingency table.

       | w₂       | ¬w₂           | Total ------+----------+---------------+--------------------- w₁    | n₁₂      | n₁ - n₁₂      | n₁ ¬w₁   | n₂ - n₁₂ | N - n₁₂       | N + n₂ - 2n₁₂ ------+----------+---------------+---------------------- Total | n₂       | N + n₁ - 2n₁₂ | N + n₁ + n₂ - 2n₁₂  where: n₁  = number of occurrences of w₁ n₂  = number of occurrences of w₂ n₁₂ = number of occurrences of phrase {w₁w₂} N   = number of 2 word phrases in our corpus

We then compute a table of expected frequencies from the marginal value of the observed frequencies.

       | w₂       | ¬w₂           ------+------------------------- w₁    | E₀₀      | E₀₁ ¬w₁   | E₁₀      | E₁₁

Values of E_ij are calculated thus:

 E_ij = N * P_m(i) * P_m(j)            ... by Multiplicative law of probability            N_m(i)    N_m(j)     = N * ----- * -------             N        N      = N_m(i) * N_m(j) / N  where P_m(i) = marginal probability of event i             = P(w_i|w_j) + P(w_i|¬w_j)       N_m(i) = frequency corresponding to P_m(i).

We then calculate the Chi-Square statistic &Chi² and the degrees of freedom ν from our sample. In our code we will just use the ChiSquare test statistic and pass in two arrays of observed and expected frequencies, but the formula below would be used if we were to do this by hand.

               (O_ij - E_ij)² Χ² = Σ_i Σ_j -----------------                   E_ij  ν = (r - 1) * (c - 1) = 1           ... where r = number of rows in table (2)                                               c = number of columns in table (2)

For given α = 0.05, we find the critical value ν of a hypothetical "fair" distribution. We use the inverseCumulativeProbability() method for this, and it returns the point where P(x < ν) = α. Thus, 95% of the hypothetical distribution lies behind the critical value. If our computed value of Χ² exceeds the critical value ν from the table, then we conclude that the (hypothetical) independence hypothesis is incorrect, and that the words {w₁,w₂} are indeed related as a phrase.

The ChiSquare approach is used by LingPipe for phrase extraction, according to the BSA book.

ChiSquarePhraseFilter

The code for the ChiSquarePhraseFilter is shown below. The theory leading to the code is explained above. Essentially, we compute the Χ² statistic from the observed and expected probabilities (based on marginal probabilities of the observations), and compare it with the critical value at a 0.05 level of significance. If the result is positive, the observed value lies outside the 95% range of expected values based on a "fair" distribution, ie one where the words are independent and equally likely, and hence can be considered to be a phrase.

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

// Source: src/main/java/net/sf/jtmt/phrase/filters/ChiSquarePhraseFilter.java package net.sf.jtmt.phrase.filters;  import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.commons.math.MathException; import org.apache.commons.math.distribution.ChiSquaredDistribution; import org.apache.commons.math.distribution.ChiSquaredDistributionImpl; import org.apache.commons.math.linear.Array2DRowRealMatrix; import org.apache.commons.math.linear.RealMatrix; import org.apache.commons.math.linear.RealVector; import org.apache.commons.math.stat.inference.ChiSquareTest; import org.apache.commons.math.stat.inference.ChiSquareTestImpl;  /**  * Calculates the Chi-Squared statistic for the expected vs observed   * frequencies, and finds if the statistic is higher than the critical   * value computed at a 0.05 significance level (ie 95% coverage).  */ public class ChiSquarePhraseFilter implements IPhraseFilter {    private final Log log = LogFactory.getLog(getClass());      private static final double ALPHA = 0.05D;      private long n;    // number of 2 word phrases in corpus   private long n1;   // number of occurrences of leading word in corpus   private long n2;   // number of occurrences of trailing word in corpus   private long n12;  // number of occurrences of phrase in corpus    public ChiSquarePhraseFilter(long n, long n1, long n2, long n12) {     this.n = n;     this.n1 = n1;     this.n2 = n2;     this.n12 = n12;   }      @Override   public double getPhraseDeterminant() {     // set up contingency table of observed frequencies     RealMatrix obs = new Array2DRowRealMatrix(2, 2);     obs.setEntry(0, 0, n12);     obs.setEntry(0, 1, (n1 - n12));     obs.setEntry(1, 0, (n2 - n12));     obs.setEntry(1, 1, (n - n12));     // compute marginal frequencies     RealVector rowTotals = obs.getRowVector(0).add(obs.getRowVector(1));     RealVector colTotals = obs.getColumnVector(0).add(obs.getColumnVector(1));     double total = colTotals.getL1Norm();     // flatten contingency table of observed frequencies     long[] observed = new long[4];     int k = 0;     for (int i = 0; i < obs.getRowDimension(); i++) {       for (int j = 0; j < obs.getColumnDimension(); j++) {         observed[k++] = (long) obs.getEntry(i, j);       }     }     // compute expected frequencies based on marginal frequencies     double[] expected = new double[4];     k = 0;     for (int i = 0; i < obs.getRowDimension(); i++) {       for (int j = 0; j < obs.getColumnDimension(); j++) {         expected[k++] =            (double) colTotals.getEntry(i) * rowTotals.getEntry(j) / total;       }     }     // find the test statistic     ChiSquareTest test = new ChiSquareTestImpl();     double chiSquare = test.chiSquare(expected, observed);     // find the critical value     ChiSquaredDistribution dist = new ChiSquaredDistributionImpl(6.0D);     double criticalValue = 0.0D;     try {       criticalValue = dist.inverseCumulativeProbability(ALPHA);     } catch (MathException e) {       log.warn(e);     }     // return the difference between the test statistic and critical value     return (chiSquare - criticalValue);   } }

As before, the results look quite believable. 103 phrases are recognized from this algorithm, and the top 10 are shown below. In the full output, there is considerable overlap, even though it does not show up here.

 1  2  3  4  5  6  7  8  9 10

irene adler       1811.5261599783912 march hare        1530.794875612239 um um             1194.2290051976697 project gutenberg 1136.9018182378147 copper beeches    862.8478133287713 don sebastian     789.5537133726627 st simon          788.6790165231548 jabez wilson      727.0858527238903 boscombe valley   708.5084636022881 ...

Resources

While figuring out this stuff, I came across couple of very good online resources, which I would like to share here.

An Intuitive Explanation of Bayes' Theorem by Eliezer S. Yudkowsky - As the title suggests, this is a very easy to understand explanation of Bayes' theorem based on commonsense explanations. Read the rest of the articles on his site if you have some time, I found them quite interesting, maybe you will too.
AP* Statistics Tutorial - The last time I used any statistics other than mean and standard deviation was in college, and that was geared to solving toy problems. This time round, I read through the entire tutorial (takes about 2-3 hours) with a view to applying it to real-life (data mining) situations. If you are in a similar situation, I highly recommend reading through it. If nothing else, it will help you understand the more statistically oriented data mining algorithms available out there.

statistics formulas cheat sheet

precalculus problems online

Tuesday, August 31, 2010

Finding Phrases - Two Statistical Approaches

Likelihood Ratio Approach

LikelihoodRatioPhraseFilter

Chi-Square Approach

ChiSquarePhraseFilter

Resources

No comments:

Post a Comment