Extendable words in nucleotide sequences

Michael S. Gelfand, Constantine G. Kozhukhin, Pavel A. Pevzner

Research output: Contribution to journalArticlepeer-review

12 Citations (Scopus)


Previous statistical analyses revealed several peculiarities of nucleotide sequences that preclude their description by existing models and thus allow one to distinguish DNA and RNA sequences from random A, T, G, C-texts. This is a consequence of the unusual distribution of certain words in nucleotide sequences: while the distribution of (most) words is consistent with Markov models of small orders, the distribution of certain words cannot be described by any previous model (anomalies in distribution of homonucleotide/homopurine/homopyrimidine runs, complementary and mirror palindromes, and non-stationary words). In this work we introduce a probabilistic approach that is partly motivated by analogy with linguistics. We also describe another important feature of DNA/RNA sequences: anomalies in distribution of words of poor nucleotide composition. We show that some classes of these words are the major obstacle for the simple Markov description of nucleotide sequences.

Original languageEnglish
Pages (from-to)129-135
Number of pages7
Issue number2
Publication statusPublished - Apr 1992
Externally publishedYes


Dive into the research topics of 'Extendable words in nucleotide sequences'. Together they form a unique fingerprint.

Cite this