This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Statistical distributions and entropy considerations in gene codes

Krystyna Lukierska-Walasek k.lukierska@uksw.edu.pl Faculty of Mathematics and Natural Sciences, Cardinal Stefan Wyszyński University, Wóycickiego 1/3, 01-938 Warsaw, Poland    Krzysztof Topolski topolski@math.uni.wroc.pl Institute of Mathematics, Wroclaw University, Pl. Grunwaldzki 2/4, 50-384 Wrocław, Poland    Krzysztof Trojanowski trojanow@ipipan.waw.pl Institute of Computer Science, Polish Academy of Science, ul.Jana Kazmierza 5,01-248 Warsaw, Poland
(August 2, 2025)
Abstract

In our paper selected linguistic features of genomes to study the statistics of the gene codes are considered. We present the information theory from which it follows that if the system is described by distributions of hyperbolic type it leads to the possibility of entropy loss and stability. We show that the histograms of gene lengths are similar to that of language words. We show the correspondence between presented theory and results for the number of replicated genes and replicated fragments of genes in genomes for Borelia burgdorferi, Escherichia coli and Saccharomyces cerevisiae S288c.

genome, gene code, gene length, statistical distributions, entropy
pacs:
87.85.mg, 87.10.Vg, 87.10.Mn

I Introduction

In last years there appeared a possibility to provide some knowledge of the genome sequence data in many organisms. The genome was studied intensively by number of different methods Holste et al. (2003); Li and Kaneko (1992); Mantegna et al. (1994, 1995); Messer et al. (2005); Peng et al. (1992, 1994); de Sousa Vieira (1999); Stanley et al. (1999); Voss (1992). The statistical analysis of DNA is complicated because of its complex structure; it consists of coding and non coding regions, repetitive elements, etc., which have an influence on the local sequence decomposition. Long range correlations in sequence compositions of DNA are much more complex than simple power law; moreover effective exponents of scale regions are varying considerably between different species Holste et al. (2003), Messer et al. (2005). In papers Mantegna et al. (1994, 1995) the Zipf approach to analyzing linguistic texts has been extended to statistical study of DNA base pair sequences.

In our paper we take into account some linguistic features of genomes to study the statistics of gene codes. One of the fundamental observations based on the information theory says that if the system is described by special distribution of hyperbolic type it implies a possibility of entropy loss. Distributions of hyperbolic type describe also property of stability and flexibility which explain that the languages can develop without changing its basic structure (see Harremoës and Topsøe (2001)). In the experimental part of our research we show that a similar situation can occur in a genome sequence data which carries the genetic information.

To demonstrate analogies between coding of genes and representation of words in human languages we generated histograms of the sizes of groups of words of the same length in selected different European languages as well as histograms of the sizes of groups of genes of the same length in a genome. Three genomes were selected for evaluation, that is, two bacteria: Borrelia burgdorferi and Escherischia coli, and a yeast: Saccharomyces cerevisiae S288c. The gene data were obtained from NCBI GeneBank Gen . Each of the histograms was successfully approximated by Asymmetric Normal Inverse Gaussian Distribution.

Then, for each of the three genomes histograms of the sizes of groups of identical genes were generated. We showed that each of them can be modelled by distributions of hyperbolic type defined in the paper.

Additionally, the histograms for sizes of groups of identical fragments of genes of different sizes were calculated. In this case any two genes are regarded as similar when there exist a gene fragment, that is, an ordered sequence of symbols of size nn which can be found in both genes. Obtained results confirmed our hypothesis that the basic structure of genome remain stabile and entropy loss is possible.

The paper is organized as follows. In Sect. II we begin with a brief presentation of some common features of linguistics and genetics. Then, in Sect. III we describe Code Length Game, Maximal Entropy Principle and the theorem about hyperbolic type distributions and entropy loss. Sect. IV contains the experimental research concerned analysis of statistical properties of gene strings in three selected genomes and consists of three parts. Histograms of gene lengths are presented in the first part. Evaluation of the frequency of genes in a genome is the subject of the second part, that is, the results of searching for replicated genes in three genomes are presented. In the third, last part we observe frequency of appearance of gene fragments in a genome for different lengths of the fragments. Sect. V concludes the paper.

II Some common features of linguistics and genetics

A language is characterized by an alphabet with letters, like for the case of latin languages: a, b, c, etc. The words are denoted as sequences of nn letters. In quantum physics the analogy to letters can be attached to pure states, and texts correspond to mixed general states. Languages can have very different alphabets, for example: computers — 0,1 (two bits), English language — 27 letters with space, or DNA — four nitric bases: G(guanine), A(adenine), C(cytosine), T(thymine). The collection of letters can be ordered or disordered. To quantify the disorder of different collections of the letters we use an entropy

H=ipilog2pi,H=-\sum\limits_{i}p_{i}\log_{2}p_{i}, (1)

where pip_{i} denotes probability of occurrence of the ii-th letter. If we take the base of logarithm 22 this will lead to the entropy measured in bits. When all letters have the same probability in all states obviously the entropy has maximum value HmaxH_{max}.

A real entropy has lower value HeffH_{eff}, because in a real languages the letters have not the same probability of appearance. Redundancy RR of language is defined Shannon (1948) as follows:

R=HmaxHeffHmax.R=\frac{H_{max}-H_{eff}}{H_{max}}. (2)

The quantity HmaxH_{max}  - HeffH_{eff} is called an information. Information depends on difference between the maximum entropy and the actual entropy. The bigger actual entropy means the smaller redundancy. Redundancy can be measured by values of the frequencies with which different letters occur in one or more texts. Redundancy RR denotes the number that if we remove the part RR of the letters determined by redundancy, the meaning of the text will be still understood.

In English some letters occur more frequently than other and similarly in DNA of vertebrates the frequency of nitric bases C and G pairs is usually less than the frequency of A and T pairs. The low value of redundancy allows in easier way to fight transcription errors in a gene code. In the papers Mantegna et al. (1994, 1995) it is demonstrated that non coding regions of eukaryotes display a smaller entropy and larger redundancy than coding regions.

III Maximal Entropy Principle and Code Length Game and Entropy Loss

In this section we provide some mathematics from the information theory which will be helpful in the quantitative formulation of our approach, for details see Harremoës and Topsøe (2001).

Let AA be the alphabetalphabet which is a discrete finite or countable infinite set. Let M+1M_{+}^{1} and M+1(A){}^{\sim}M_{+}^{1}(A) are respectively, the set of probability measures on AA and the set of non-negative measures PP, such that P(A)1P(A)\leq 1. The elements in AA can be thought as letters. By K(A)K(A) we denote the set of mappings, compact codes, k:A[0,]k\,:A\,\rightarrow[0,\infty], which satisfy Kraft’s equality Cover and Thomas (1991):

iAexp(ki)=1.\sum\limits_{i\in A}\exp(-k_{i})=1. (3)

By K(A){}^{\sim}K(A) we denote the set of all mappings, general codes, k:A[0,]k\,:\,A\rightarrow[0,\infty], which satisfy Kraft’s inequality Cover and Thomas (1991):

iAexp(ki)1.\sum\limits_{i\in A}\exp(-k_{i})\leq 1. (4)

For kK(A)k\in^{\sim}K(A) and iAi\in A, kik_{i} is the code length, for example the length of the word. For kk\in K(A){}^{\sim}K(A) and PM+1(A)P\in M_{+}^{1}(A) the average code length is defined as

<k,P>=iAkipi.<k,P>=\sum\limits_{i\in A}k_{i}p_{i}. (5)

There is bijective correspondence between pip_{i} and kik_{i}

ki=lnpiandpi=exp(ki).k_{i}=-\ln p_{i}\quad\mbox{and}\quad p_{i}=\exp(-k_{i}).

For PM+1(A)P\in M_{+}^{1}(A) we also introduce the entropy

H(p)=iApilnpi.H(p)=-\sum\limits_{i\in A}p_{i}\ln p_{i}. (6)

The entropy can be represented as minimal average code length, (see Harremoës and Topsøe (2001)):

H(P)=minkK(A)<k,P>.H(P)=\min_{k\in^{\sim}K(A)}<k,P>.

Let 𝒫+1(A){\cal P}\subseteq{\cal M}_{+}^{1}(A) than

Hmax(𝒫)\displaystyle H_{\max}(\cal P) =\displaystyle= supP𝒫infkK(A)<k,P>\displaystyle\sup_{P\in\cal P}\inf_{k\in^{\sim}K(A)}<k,P> (7)
=\displaystyle= supP𝒫H(P)\displaystyle\sup_{P\subset\cal P}H(P)
\displaystyle\leq infkK(A)supP𝒫<k,P>\displaystyle\inf_{k\in^{\sim}K(A)}\sup_{P\subset\cal P}<k,P>
=\displaystyle= Rmin(𝒫).\displaystyle R_{min}(\cal P).

The formula (7) presents the Nash optimal strategies. min{\cal R}_{min} denotes minimumminimum riskrisk, kk denotes the Nash equilibrium code, PP denotes probability. For example, in a linguistics the listener is a minimizer, speaker is a maximizer. We have words with distributions pip_{i} and their codes kik_{i}, i=1,2,i=1,2,.... The listener chooses codes kik_{i},the speaker chooses probability distributions pip_{i}.
We can notice that Zipf argued Zipf (1949) that in the development of a language vocabulary balance is reached as a result of two opposing forces: unificationunification which tends to reduce the vocabulary and corresponds to a principle of least effort, seen from point of view of speaker and diversification connected with the listeners wish to know meaning of speech.

The principle (7) is so basic as Maximum Entropy Principle has a sense that search for one type of optimal strategy called as Code Length Game translates directly into a search for distributions with maximum entropy. It is a given a code kK(A)k\in\,^{\sim}K(A) and distribution PM+1(A)P\in M^{1}_{+}(A). Optimal strategy according H(P)=infkK<k,P>H(P)=inf_{k\in\,^{\sim}K}<k,P> is represented by entropy H(P)H(P), where actual strategy is represented by <k,P><k,P>.

Zipf’s law is an empirical observation which relates rank and frequency of words in natural language Zipf (1949). This law suggests modeling by distributions of hyperbolic type Harremoës and Topsøe (2001), because no distributions over NN have probabilities proportional to 1/i1/i, due to the lack of normalization condition.
We consider a class of distributions P=(p1,p2,)P=(p_{1},p_{2},...) over NN. If p1,p2p_{1}\geq,p_{2}\geq..., PP is said to be hyperbolic if for any given α>1\alpha>1, piiαp_{i}\geq i^{-\alpha} for infinitely many indexes ii. As an example we can choose pii1(log(i))cp_{i}\sim i^{-1}(\log(i))^{-c} for some constant c>2c>2.

The Code Length Game for model 𝒫M+1(N){\cal P}\in M_{+}^{1}(N) with codes k:A[0,]k:A\rightarrow[0,\infty] for which iAexp(ki)=1\sum\limits_{i\in A}\exp(-k_{i})=1, is in equilibrium if and only if Hmax(co(𝒫)=Hmax(𝒫)H_{max}(co({\cal P})=H_{max}({\cal P}). In such a case a distribution PP^{*} is the HmaxH_{max} attractor such that PnPP_{n}\rightarrow P^{*}, if for every sequence (Pn)n>1𝒫(P_{n})_{n>1}\subseteq\cal P for which H(Pn)Hmax(𝒫)H(P_{n})\rightarrow H_{max}(\cal P). One expects that H(P)=Hmax(𝒫)H(P^{*})=H_{max}(\cal P) but in the case with entropy loss we have H(P)<Hmax(𝒫)H(P^{*})<H_{max}(\cal P). Such possibility appears when distributions are hyperbolic. It follows from theorem in Harremoës and Topsøe (2001).

Theorem. Assume that PM+1(N)P^{*}\in M_{+}^{1}(N) is of finite entropy and it has ordered point probabilities. The necessary and sufficient condition that PP^{*} can occur as HmaxH_{max}–attractor in a model with entropy loss, is that PP^{*} is hyperbolic. If this condition is fulfilled than for every hh with H(P)<h<H(P^{*})<h<\infty, there exists a model 𝒫=𝒫h{\cal P}={\cal P}_{h} with PP^{*} as HmaxH_{max}–attractor and Hmax(𝒫h)=hH_{max}({\cal P}_{h})=h. In fact, 𝒫h=(P|<k,P>h){\cal P}_{h}=(P|<k^{*},P>\leq h) is a largest model. kk^{*} denotes the code adopted to PP^{*}, that is, k=ln(p)k^{*}=-\ln(p^{*}), i>1i>1.

As an example we can consider ”an ideal” language where the frequencies of words are described by hyperbolic distribution PP^{*} with a finite entropy. At a certain stage of life one is able to communicate at reasonably height rate about H(P)H(P^{*}) and improve language by introduction of specialized words, which occur seldom in a language as a whole. This process can be continued during the rest of life. One is able to express complicated ideas develop language without changing a basic structure of the language. We can expect similar situation in gene strings, which also carry an information.

IV Application to genetics

IV.1 Part I– Histograms for gene lengths and word lengths

In this part we study the histograms of gene lengths. One can observe, that the histograms are of the same type as histograms for the word lengths. We model histograms of gene lengths by Asymmetric Inverse Gaussian Distributions as in the case of histograms of word lengths.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Histograms of the gene length in the genome of Borrelia burgdorferi (the left graph), Escherichia coli (the middle graph) and Saccharomyces cerevisiae S288c (the right graph) with the fits of AIGD

Figure 1 presents histograms of the gene length in three genome and their approximations by probability the Asymmetric Inverse Gaussian Distribution (AIGD). The AIGD has following parameters: for Borrelia burgdorferi (μ=420.9643\mu=420.9643, σ=466.0570\sigma=466.0570 and γ=468.7277\gamma=468.7277); for Escherischia coli (μ=448.0338\mu=448.0338, σ=531.6735\sigma=531.6735 and γ=529.9635\gamma=529.9635); for Saccharomyces cerevisiae S288c (μ=201.5467,\mu=201.5467, σ=257.1359\sigma=257.1359 and γ=322.3161\gamma=322.3161).

Systems where maximum entropy distributions does not exists can be described by distributions of hyperbolic type. In such systems stability and flexibility is present similarly as in natural languages or in description of genes where redundancy is confirmed. It is interesting to show that the histograms for gene lengths resemble other histograms obtained for word lengths in human languages Win . In Figure 2 we present the histograms of the sizes of groups of words of the same length for four European languages: Czech, German, Italian and Polish which also the fit the Asymetric Inverse Gaussian distributions.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Graphs of the sizes of groups of words of the same length in four different European languages from the left to the right, Czech, German, Italian and Polish respectively.

In this case the AIGD has following parameters: for Czech (μ=4.0883,\mu=4.0883, σ=2.1168,\sigma=2.1168, and γ=5.4246\gamma=5.4246); for German (μ=0.5941,\mu=-0.5941, σ=2.2183,\sigma=2.2183, and γ=12.2476\gamma=12.2476); for Italian (μ=11.0250,\mu=11.0250, σ=2.4615\sigma=2.4615 and γ=0.5362\gamma=-0.5362); and for Polish (μ=15.1809,\mu=-15.1809, σ=1.3876,\sigma=1.3876, and γ=27.3526\gamma=27.3526). We investigated asymptotic properties of these distributions. The tails of these distributions appear to be of the power type. We show it for Borrelia and Saccharomyces.

Figure 3 shows the fit of the tail of gene length distribution of the Borrelia burgdorferi to function of the form f(x)=Cx1af(x)=C\,x^{-1-a}, where C>0C>0 and a>2a>2 are constants. The least squares best fit are obtained with C=1500,899C=1500,899 and a=2,165a=2,165.

Refer to caption

Figure 3: The comparison of the tail of gene length distribution of the Borrelia burgdorferi to the tail of the power type distribution.

Figure 4 presents plot of the logarithm of gene length versus logarithm of its length rank for Saccharomyces cerevisiae S288c.

Refer to caption
Figure 4: Plot of the logarithm of the gene length LL versus logarithm of its length rank for the Saccharomyces cerevisiae S288c genome

Refer to caption

Figure 5: Plot of the logarithm of the gene length LL versus logarithm of it length rank for the genes with size bigger then 5379 bp in Saccharomyces cerevisiae S288c genome. The straight line represent the linear regression line y=0.2413x+8.4718.y=-0.2413\,x+8.4718.

Figure 5 presents a part of the previous plot, the fit of the tail(for genes with size bigger than 5379 bp) of gene length distribution of the Saccharomyces cerevisiae S288c. The straight line denotes that tail of the gene length distributions is of power type.

IV.2 Part II – Statistics for replicated genes

In this part we evaluated numbers of identical genes in each of the three genomes. Any two genes are identical iff they are of the same length and have the same content ordered in the same way. Otherwise, the genes are regarded as different, that is, for example, a smallest difference in the order of symbols in any two genes of the same size and content makes them different. The main parameters of the gene ”language” for each on the genomes is presented in Table 1. One can see, that gene lengths span within the range from 6381 for Borrelia burgdorferi to 14682 — for Saccharomyces cerevisiae S288c. In every case the range is many times wider that the number of genes in the genome. However, in spite of this, some number of duplicated genes can be found in genomes of each of the analysed bacterium and yeast.

Table 1: Number of genes, maximum and minimum length of genes in the three selected genomes
genome num. of genes min. length max. length
Borrelia burgdorferi 1242 120 6501
Escherischia coli 4920 75 11421
Saccharomyces cerevisiae S288c 5906 51 14733

The lists of identical genes are presented in Tables 2, 3 and 4 (the genes are identified with labels originated from GeneBank files). Each table shows groups of genes sorted from the largest groups to the smallest ones. Respectively to the data in the tables two histograms with sorted group sizes for Escherischia coli and Saccharomyces cerevisiae S288c are presented in Figure 6 (the histogram for Borrelia burgdorferi is omitted due to its simplicity — just two groups which consist of only two genes appear in this case).

Refer to caption Refer to caption

Figure 6: Graphs of the sizes of groups of identical genes in Escherischia coli (on the left) and Saccharomyces cerevisiae S288c (on the right) — a zoom on the first 30 sizes of groups sorted in descending order

Figure 6 shows how often the same sequences are repeated in genomes. For example, in the case of Escherischia coli the two mostly repeated sequence with the rank numbers one and two appear in 5 different genes each, the third sequence appears 4 times, etc.

The series with sizes of the groups of identical genes can be interpreted as probabilities of gene appearance in the genome. To obtain this, for every unique sequence we assign its group size, that is the number of appearances in the genome. Then, the group sizes have to be normalized, because probabilities have to sum up to 1. Such probability of appearance of unique sequences is of hyperbolic type in the sense of definition piiαp_{i}\geq i^{-\alpha}, that is, one can show that the inequality is satisfied for each of the values in the series respectively for: Borrelia burgdorferi — for, e.g., α=9.2785\alpha=9.2785, Escherischia coli — for, e.g., α=9.9426\alpha=9.9426, and Saccharomyces cerevisiae S288c — for, e.g., α=10.528\alpha=10.528.

It is necessary to mention here, that due to the normalization of a series of nonzero values the inequality piiαp_{i}\geq i^{-\alpha} can never be satisfied for i=1i=1 because 1α=11^{-\alpha}=1 regardless of α\alpha (simply, the probability of any option is never equal 1 if there is more that one random option to chose and all of them have a non-zero chance to be selected). Therefore, to show the hyperbolic nature of a single series, we needed to start with comparison of the second element pi=2p_{i=2} and 2α2^{-\alpha}, that is, pi=1p_{i=1} is omitted. For clarity, we did also another calculations where the first element pi=1p_{i=1} is compared with (i+1)α(i+1)^{-\alpha} and so on, and obtained values of α\alpha are the same as in the former method.

Graphs of sizes of gene groups from Figure 6 and the observation that probabilities of appearance of unique sequences are of hyperbolic type suggest that the realization of the theorem and that entropy loss is possible. Similarly, as in the case of languages the basic structure of genomes remains stabile.

Table 2: Groups of identical genes in the genome of Borrelia burgdorferi; every group starts with a group size, the length of a unique sequence which is written in brackets and the list of gene labels from GeneBank
2 (204) gi—387827798—ref—NC_\_017424.1—:15230-15433 B. burgd. N40 plasmid N40_\_cp32-10
gi—387827867—ref—NC_\_017402.1—:15177-15380 B. burgd. N40 plasmid N40_\_cp32-9
2 (237) gi—387827798—ref—NC_\_017424.1—:8732-8968 B. burgd. N40 plasmid N40_\_cp32-10
gi—387827867—ref—NC_\_017402.1—:8730-8966 B. burgd. N40 plasmid N40_\_cp32-9
Table 3: Groups of identical genes in the genome of Escherischia coli; every group starts with a group size, the length of a unique sequence which is written in brackets and the list of gene labels from GeneBank
5 (378) gi—387604868—ref—NC_\_017627.1—:27400-27777 E. coli 042 plasmid pAA
gi—387604868—ref—NC_\_017627.1—:c40646-40269 E. coli 042 plasmid pAA
gi—387605479—ref—NC_\_017626.1—:c1601255-1600878 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c20942-20565 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c2746558-2746181 E. coli 042, complete genome
5 (522) gi—387605479—ref—NC_\_017626.1—:4574451-4574972 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:565707-566228 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c1238313-1237792 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c1633110-1632589 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c7035-6514 E. coli 042, complete genome
4 (723) gi—387605479—ref—NC_\_017626.1—:2969552-2970274 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:4575077-4575799 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c1632484-1631762 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c6409-5687 E. coli 042, complete genome
3 (276) gi—387605479—ref—NC_\_017626.1—:c4352916-4352641 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c4377124-4376849 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c665916-665641 E. coli 042, complete genome
3 (504) gi—387605479—ref—NC_\_017626.1—:c4352722-4352219 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c4376930-4376427 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c665722-665219 E. coli 042, complete genome
2 (276) gi—387604868—ref—NC_\_017627.1—:27080-27355 E. coli 042 plasmid pAA
gi—387605479—ref—NC_\_017626.1—:c4946969-4946694 E. coli 042, complete genome
2 (276) gi—387604868—ref—NC_\_017627.1—:c40966-40691 E. coli 042 plasmid pAA
gi—387605479—ref—NC_\_017626.1—:c2746878-2746603 E. coli 042, complete genome
2 (276) gi—387605479—ref—NC_\_017626.1—:c1601575-1601300 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c21262-20987 E. coli 042, complete genome
2 (858) gi—387604868—ref—NC_\_017627.1—:13-870 E. coli 042 plasmid pAA
gi—387604868—ref—NC_\_017627.1—:c108174-107317 E. coli 042 plasmid pAA
2 (258) gi—387604868—ref—NC_\_017627.1—:112802-113059 E. coli 042 plasmid pAA
gi—387604868—ref—NC_\_017627.1—:c108731-108474 E. coli 042 plasmid pAA
2 (177) gi—387605479—ref—NC_\_017626.1—:1409458-1409634 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c2263053-2262877 E. coli 042, complete genome
2 (492) gi—387605479—ref—NC_\_017626.1—:3438290-3438781 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:5148233-5148724 E. coli 042, complete genome
2 (183) gi—387605479—ref—NC_\_017626.1—:1409283-1409465 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c2263228-2263046 E. coli 042, complete genome
2 (891) gi—387604868—ref—NC_\_017627.1—:33619-34509 E. coli 042 plasmid pAA
gi—387604868—ref—NC_\_017627.1—:c110418-109528 E. coli 042 plasmid pAA
2 (408) gi—387605479—ref—NC_\_017626.1—:c2327742-2327335 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c5136565-5136158 E. coli 042, complete genome
2 (327) gi—387604868—ref—NC_\_017627.1—:33296-33622 E. coli 042 plasmid pAA
gi—387604868—ref—NC_\_017627.1—:c21490-21164 E. coli 042 plasmid pAA
2 (108) gi—387605479—ref—NC_\_017626.1—:c4080133-4080026 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c4081099-4080992 E. coli 042, complete genome
2 (156) gi—387605479—ref—NC_\_017626.1—:1411502-1411657 E. coli 042, complete genome
gi—387605479—ref—NC_\_017626.1—:c2260557-2260402 E. coli 042, complete genome
Table 4: Groups of identical genes in the genome of Saccharomyces cerevisiae S288c; every group starts with a group size, the length of a unique sequence which is written in brackets and the list of gene labels from GeneBank
4 (132) gi—330443681—ref—NC_\_001144.5—:c468958-468827 S. c. S288c chr. XII
gi—330443681—ref—NC_\_001144.5—:c472610-472479 S. c. S288c chr. XII
gi—330443681—ref—NC_\_001144.5—:c482190-482059 S. c. S288c chr. XII
gi—330443681—ref—NC_\_001144.5—:c485842-485711 S. c. S288c chr. XII
4 (1323) gi—330443520—ref—NC_\_001136.10—:1206997-1208319 S. c. S288c chr. IV
gi—330443531—ref—NC_\_001137.3—:c449024-447702 S. c. S288c chr. V
gi—330443681—ref—NC_\_001144.5—:c481601-480279 S. c. S288c chr. XII
gi—330443753—ref—NC_\_001148.4—:56748-58070 S. c. S288c chr. XVI
4 (1089) gi—330443681—ref—NC_\_001144.5—:c470405-469317 S. c. S288c chr. XII
gi—330443681—ref—NC_\_001144.5—:c474057-472969 S. c. S288c chr. XII
gi—330443681—ref—NC_\_001144.5—:c483637-482549 S. c. S288c chr. XII
gi—330443681—ref—NC_\_001144.5—:c487289-486201 S. c. S288c chr. XII
3 (1323) gi—330443638—ref—NC_\_001142.9—:472760-474082 S. c. S288c chr. X
gi—330443681—ref—NC_\_001144.5—:593438-594760 S. c. S288c chr. XII
gi—330443688—ref—NC_\_001145.3—:196628-197950 S. c. S288c chr. XIII
3 (345) gi—330443681—ref—NC_\_001144.5—:472113-472457 S. c. S288c chr. XII
gi—330443681—ref—NC_\_001144.5—:485345-485689 S. c. S288c chr. XII
gi—330443681—ref—NC_\_001144.5—:488997-489341 S. c. S288c chr. XII
3 (5391) gi—330443520—ref—NC_\_001136.10—:1526321-1531711 S. c. S288c chr. IV
gi—330443681—ref—NC_\_001144.5—:1072508-1077898 S. c. S288c chr. XII
gi—330443743—ref—NC_\_001147.6—:1085473-1090863 S. c. S288c chr. XV
2 (528) gi—330443489—ref—NC_\_001135.5—:13282-13809 S. c. S288c chr. III
gi—330443489—ref—NC_\_001135.5—:200442-200969 S. c. S288c chr. III
2 (186) gi—330443590—ref—NC_\_001140.6—:c212720-212535 S. c. S288c chr. VIII
gi—330443590—ref—NC_\_001140.6—:c214718-214533 S. c. S288c chr. VIII
2 (1314) gi—330443743—ref—NC_\_001147.6—:1080276-1081589 S. c. S288c chr. XV
gi—330443753—ref—NC_\_001148.4—:c10870-9557 S. c. S288c chr. XVI
2 (363) gi—330443681—ref—NC_\_001144.5—:c13445-13083 S. c. S288c chr. XII
gi—330443715—ref—NC_\_001146.8—:781918-782280 S. c. S288c chr. XIV
2 (363) gi—330443595—ref—NC_\_001141.2—:c9155-8793 S. c. S288c chr. IX
gi—330443638—ref—NC_\_001142.9—:c9138-8776 S. c. S288c chr. X
2 (1323) gi—330443391—ref—NC_\_001133.9—:c165866-164544 S. c. S288c chr. I
gi—330443753—ref—NC_\_001148.4—:c810269-808947 S. c. S288c chr. XVI
2 (1323) gi—330443531—ref—NC_\_001137.3—:c498124-496802 S. c. S288c chr. V
gi—330443688—ref—NC_\_001145.3—:c378325-377003 S. c. S288c chr. XIII
2 (1323) gi—330443743—ref—NC_\_001147.6—:595112-596434 S. c. S288c chr. XV
gi—330443753—ref—NC_\_001148.4—:844709-846031 S. c. S288c chr. XVI
2 (1323) gi—330443520—ref—NC_\_001136.10—:1096064-1097386 S. c. S288c chr. IV
gi—330443578—ref—NC_\_001139.9—:c823015-821693 S. c. S288c chr. VII
2 (5313) gi—330443543—ref—NC_\_001138.5—:138204-139496,139498-143517 S. c. S288c chr. VI
gi—330443578—ref—NC_\_001139.9—:811738-813030,813032-817051 S. c. S288c chr. VII
2 (1317) gi—330443543—ref—NC_\_001138.5—:138204-139520 S. c. S288c chr. VI
gi—330443578—ref—NC_\_001139.9—:811738-813054 S. c. S288c chr. VII
2 (5580) gi—330443578—ref—NC_\_001139.9—:1084864-1084882,1085031-1090591 S. c. S288c chr. VII
gi—330443753—ref—NC_\_001148.4—:c6007-5989,c5840-280 S. c. S288c chr. XVI
2 (651) gi—330443531—ref—NC_\_001137.3—:c5114-4602,c4322-4185 S. c. S288c chr. V
gi—330443681—ref—NC_\_001144.5—:1066572-1067084,1067364-1067501 S. c. S288c chr. XII
2 (1770) gi—330443595—ref—NC_\_001141.2—:c18553-16784 S. c. S288c chr. IX
gi—330443638—ref—NC_\_001142.9—:c18536-16767 S. c. S288c chr. X
2 (495) gi—330443743—ref—NC_\_001147.6—:1082718-1083212 S. c. S288c chr. XV
gi—330443753—ref—NC_\_001148.4—:c8427-7933 S. c. S288c chr. XVI
2 (633) gi—330443489—ref—NC_\_001135.5—:c13018-12386 S. c. S288c chr. III
gi—330443489—ref—NC_\_001135.5—:c200178-199546 S. c. S288c chr. III
2 (1140) gi—330443482—ref—NC_\_001134.8—:c811479-810340 S. c. S288c chr. II
gi—330443688—ref—NC_\_001145.3—:7244-8383 S. c. S288c chr. XIII

IV.3 Part III – statistics for replicated fragments

One can see that the numbers of identical genes in a genome are rather small mostly due to the fact, that gene sizes differ to each other in many cases. In the third part of the experimental research we evaluate numbers of identical fragments of genes in the way where the ii-th symbol of one gene sequence being compared is not necessarily aligned against the ii-th symbol of the other. Clearly, the size of compared fragment is constant, however, the fragment starting position in the gene may vary. For example, in the case of Borrelia burgdorferi the size of the shortest gene equals 120 symbols whereas the size of the longest one – 6501. Assuming that the fragment size equals 120, there is just one fragment in the shortest gene and 6381 fragments in the longest one: a fragment from the first symbol to the 120-th one, from the second to 121-st, from the third to 122-nd, and so on. This way, for the fragment size equal 120 all of the genes are compared to each other many times, that is, each gene can be compared as many times as the number of fragments can be selected inside it.

In this part we started from searching for identical fragments of size equal the size of the shortest gene in the genome, that is, fragments of size 120 for Borrelia burgdorferi, 75 – for Escherischia coli, and 51 for Saccharomyces cerevisiae S288c. The figures with sizes of groups of identical fragments sorted in descending order are presented in Figure 7.

Refer to caption
Refer to caption
Refer to caption
Figure 7: Graph of sizes of the groups of identical gene fragments in Borrelia burgdorferi (left), Escherischia coli (center) and Saccharomyces cerevisiae S288c (right) — a zoom on the first 1000 sizes of groups sorted in descending order

In spite of the fact that in this comparisons the fragment starting position in compared genes may vary, both graphs in Figure 6 and graphs in Figure 7 present similar ”staircase” type. This confirms the stability and flexibility of gene structures.

Refer to caption
Refer to caption
Refer to caption
Figure 8: Graph of sizes of the groups of identical gene fragments in Borrelia burgdorferi (the left graph — fragment sizes vary from 120 to 14), Escherischia coli (the right graph — fragment sizes vary from 75 to 14) and Saccharomyces cerevisiae S288c (the bottom graph — fragment sizes vary from 51 to 14)

It would be interesting to know how the characteristic of identical fragment numbers in a gene varies respectively to different sizes of the fragment size. Therefore, for each of the three genomes we evaluated numbers of identical fragments equal or smaller that the size of the shortest gene. For Borrelia burgdorferi the fragment size varied from 120 to 14, for Escherischia coli – from 75 to 14, and for Saccharomyces cerevisiae S288c – from 51 to 14. In every case the lower limit for the fragment length was set to 14 because the probability of fragment appearance in the gene grow rapidly for shorter fragments due to the limited number of permutations of four symbols in such a short sequence. Aggregated graphs with histograms of sizes of groups of identical fragments sorted in descending order are depicted in Figure 8. The three graphs show a series of histograms obtained for each of the possible fragment sizes for the three genomes. The histograms all together build a surface over the two-dimensional domain where one axis represents a group number in the list sorted by the group size and the other – a fragment size.

These figures present two dimensional surfaces with very regular curves. The graph shape and its regularity confirm stability and flexibility of gene structures.

V Final remarks

In this paper we follow the statistical description of genes in three selected genomes aimed at finding arguments supporting thesis that the entropy loss in gene ”language” is possible and the basic structure of genomes remain stabile. We show that histograms of gene length and word length are similar and both can be modeled by Asymmetric Inverse Gaussian distributions. Additionally, tails of distributions for gene lengths appear to be of the power type. We test selected genomes and describe the gene code statistics dealing with the number of repetitions and replicated gene fragments in genomes. Considering that distributions of hyperbolic type describe property of stability and flexibility, we show that histograms of identical genes interpreted as probabilities of gene appearance in the genome are of hyperbolic type in the sense of definition piiαp_{i}\geq i^{-\alpha}, and show example values of α\alpha for each of the genomes. We show also regularity and similarity of histograms with numbers of replicated genes and replicated gene fragments. All the obtained results confirm our thesis.

Acknowledgements.
One of the authors (K. L-W) would like to thank prof. Franco Ferrari for discussions.

References

  • Holste et al. (2003) D. Holste, I. Grosse, S. Beirer, P. Schieg,  and H. Herzel, Phys. Rev. E, 67, 061913 (2003).
  • Li and Kaneko (1992) W. Li and K. Kaneko, EPL (Europhysics Letters), 17, 655 (1992).
  • Mantegna et al. (1994) R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons,  and H. E. Stanley, Phys. Rev. Lett., 73, 3169 (1994).
  • Mantegna et al. (1995) R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons,  and H. E. Stanley, Phys. Rev. E, 52, 2939 (1995).
  • Messer et al. (2005) P. W. Messer, P. F. Arndt,  and M. Lässig, Phys. Rev. Lett., 94, 138103 (2005).
  • Peng et al. (1992) C. K. Peng, S. Buldyrev, A. Goldberger, S. Havlin, F. Sciortino, M. Simons,  and H. E. Stanley, Nature, 365, 168 (1992).
  • Peng et al. (1994) C. K. Peng, S. V. Buldyrev, S. Havlin, M. Simons, H. E. Stanley,  and A. L. Goldberger, Phys. Rev. E, 49, 1685 (1994).
  • de Sousa Vieira (1999) M. de Sousa Vieira, Phys. Rev. E, 60, 5932 (1999).
  • Stanley et al. (1999) H. Stanley, S. Buldyrev, A. Goldberger, S. Havlin, C.-K. Peng,  and M. Simons, Physica A: Statistical Mechanics and its Applications, 273, 1 (1999).
  • Voss (1992) R. Voss, Phys Rev Lett., 68, 3805 (1992).
  • Harremoës and Topsøe (2001) P. Harremoës and F. Topsøe, Entropy, 3, 191 (2001).
  • (12) “The national center for biotechnology information, GeneBank,” ftp://ftp.ncbi.nih.gov.
  • Shannon (1948) C. E. Shannon, The Bell System Technical Journal, 27, 379–423, 623–656 (1948).
  • Cover and Thomas (1991) T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley and Sons Inc., New York, 1991).
  • Zipf (1949) K. G. Zipf, Human Behaviour and the Principle of Least Effort (Addison-Wesley, Cambridge, 1949).
  • (16) “WinEdt dictionaries,” http://www.winedt.org/Dict/.