Home» Long ELS Phrases»The Experiment Bookmark and Share

The Experiment

We use for our lexicon all the words of the Tanach. The text from which we obtain the maximal ELS phrases is the text of the Torah (the 5 books of Moses). We examine three different ranges for ELS skips: 2-100,2-1000, and 1002-2000. For each maximal ELS phrase we measure its difficulty class and the average conditional entropy of each letter of the phrase given the previous four letters. For our monkey text we randomly permute the letters of the Torah text.

We want to determine if there is any difference between the ELS maximal phrases generated by the Torah text as opposed to the letter permuted Torah text. We test the Null hypothesis that there is no difference between the maximal ELS phrases generated by the Torah text and those generated by the monkey text against the Alternative hypothesis that there is some difference. The natural approach is to design a classifier that would assign each maximal ELS phrase as either coming from the Torah text or coming from the letter permuted Torah text. With equal prior probabilities, we design a quadratic Gaussian classifier.

The quadratic Gaussian discriminant function assigns a feature vector x to class 1 when
The quadratic Gaussian discriminant

otherwise it assigns to class 2.

We estimate the mean and covariance matrices from the sample mean and covariance matrices for the two-dimensional feature vectors coming from the Torah text (class 1) and coming from the monkey text (class 2). The classifier result is summarized by the fraction of ELS maximal phrases that are correctly assigned.

Even though each ELS in the ELS phrase is, by construction, a word in the lexicon, most all ELS maximal phrases are non-sense. So we expect the classification accuracy to be not much better than 50%. If there are more ELS maximal phrases coming from the Torah text that are proper Hebrew, compared to those generated from a monkey text, and if our features are the right features to detect this, we would expect the classification accuracy fraction to be somewhat more than 50%. But how much more does it have to be to be statistically significant?

To determine whether the fraction is statistically significant, we use a permutation test. We take the feature vectors coming from the maximal ELS phrases of the Torah text and coming from the maximal ELS phrases of the letter permuted Torah text and randomly shuffle them together. Then we divide them into two groups with the same sizes as had been produced by the Torah text and the monkey text. Using this data set, we design a quadratic Gaussian classifier and as before determine the fraction of correct assignment. We do this for N trials to determine the fraction of trials that produce a fraction of correct classification that is better than that produced by the Torah text against the monkey text. Should this fraction be small, say smaller than 1/100 we would reject the Null hypothesis.

<<< Previous     Next >>>