Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not.

Besides calculating a P -value with the standard normal approximation associated with z -scores, we used two extra statistical controls to evaluate the significance of over-represented words.

These controls have important implications for evaluating over- and under-represented words with z -scores.

These elements can be identified by enumerative methods, which count all possible DNA words of a certain length in promoter sequences and then use statistics such as z -scores to evaluate over-represented words ( 13 – 16 ).

Here we describe the results of applying enumerative methods to eight-letter words in the human PPRs where the statistical significance of over-represented words was determined using three different methods: (i) analytically derived z -scores (the standard method of assigning statistical significances to exact word matches in DNA); (ii) computer simulation of the Markov chain underlying the z -scores, to compare the z -scores with the actual extreme value distribution that they are supposed to approximate; and (iii) computer simulation of 1000 mock data sets, composed of matched, uniform random DNA sequences from the human genome (which produced P -values that were much more conservative than the z -scores).

Statistical analysis of over-represented words in human promoter sequences

Leonardo Mariño-Ramírez et al.

Nucleic Acids Research , 2004

