Home» Long ELS Phrases»The Experiment |
The Experiment
We use for our lexicon all the words of the Tanach. The text from which we obtain the maximal
ELS phrases is the text of the Torah (the 5 books of Moses). We examine three different ranges
for ELS skips: 2-100,2-1000, and 1002-2000. For each maximal ELS phrase we measure its difficulty class and the average conditional entropy of each letter of the phrase given the previous four letters. For our monkey text we randomly permute the letters of the Torah text.
We want to determine if there is any difference between the ELS maximal phrases generated by the
Torah text as opposed to the letter permuted Torah text. We test the Null hypothesis that there
is no difference between the maximal ELS phrases generated by the Torah text and those generated
by the monkey text against the Alternative hypothesis that there is some difference. The natural approach is to design a classifier that would assign each maximal ELS phrase as either coming from the Torah text
or coming from the letter permuted Torah text. With equal prior probabilities, we design a quadratic
Gaussian classifier.
The quadratic Gaussian discriminant function assigns a feature vector x to class 1 when
otherwise it assigns to class 2.
We estimate the mean and covariance matrices from the sample mean and covariance
matrices for the two-dimensional feature vectors coming from the Torah text (class 1)
and coming from the monkey text (class 2).
The classifier result is summarized by the fraction of ELS maximal phrases that are correctly
assigned.
Even though each ELS in the ELS phrase is, by construction, a word in the lexicon, most all ELS maximal
phrases are non-sense. So we expect the classification accuracy to be not much better than 50%.
If there are more ELS maximal phrases coming from the Torah text that are proper Hebrew,
compared to those generated from a monkey text, and if our features are the right features to
detect this, we would expect the classification accuracy fraction to be somewhat more than 50%.
But how much more does it have to be to be statistically significant?
To determine whether the fraction is statistically significant, we use a permutation test.
We take the feature vectors coming from the maximal ELS phrases of the Torah text and coming
from the maximal ELS phrases of the letter permuted Torah text and randomly shuffle them together.
Then we divide them into two groups with the same sizes as had been produced by the Torah text and the
monkey text. Using this data set, we design a quadratic Gaussian classifier and as before
determine the fraction of correct
assignment. We do this for N trials to determine the fraction of trials that produce a fraction
of correct classification that is better than that produced by the Torah text against the monkey text.
Should this fraction be small, say smaller than 1/100 we would reject the Null hypothesis.