This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Algorithms for normalized multiple sequence alignments

Eloi Araujo1,{}^{1,}\qquad111Corresponding author Diego P. Rubert1
Luiz Rozante2{}^{2}\qquad Fábio V. Martinez1
1 Faculdade de Computação
Universidade Federal de Mato Grosso do Sul
Brazil
2 Centro de Matemática, Computação e Cognição
Universidade Federal do ABC
Brazil
Abstract

Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and other fields. While the alignment of two sequences may be performed swiftly in many applications, the simultaneous alignment of multiple sequences proved to be naturally more intricate. Although most multiple sequence alignment (MSA) formulations are NP-hard, several approaches have been developed, as they can outperform pairwise alignment methods or are necessary for some applications. Taking into account not only similarities but also the lengths of the compared sequences (i.e. normalization) can provide better alignment results than both unnormalized or post-normalized approaches. While some normalized methods have been developed for pairwise sequence alignment, none have been proposed for MSA. This work is a first effort towards the development of normalized methods for MSA. We discuss multiple aspects of normalized multiple sequence alignment (NMSA). We define three new criteria for computing normalized scores when aligning multiple sequences, showing the NP-hardness and exact algorithms for solving the NMSA using those criteria. In addition, we provide approximation algorithms for MSA and NMSA for some classes of scoring matrices.

Keywords: Multiple sequence alignment (𝐌𝐒𝐀\mathbf{MSA}), Normalized multiple sequence alignment (𝐍𝐌𝐒𝐀\mathbf{NMSA}), Algorithms and complexity

1 Introduction

Sequence alignment lies at the foundation of bioinformatics. Several procedures rely on alignment methods for a range of distinct purposes, such as detection of sequence homology, secondary structure prediction, phylogenetic analysis, identification of conserved motifs or genome assembly. On the other hand, alignment techniques have also been reshaped and found applications in other fields, such as natural language processing, pattern recognition, or social sciences [AT00, AG97, BL02, MV93].

Given its range of applications in bioinformatics, extensive efforts have been made to improve existing or developing novel methods for sequence alignment. The simpler ones compare a pair of sequences in polynomial time on their lengths, usually trying to find editing operations (insertions, deletions, and substitutions of symbols) that transform one sequence into another while maximizing or minimizing some objective function called edit distance [HAR09]. This concept can naturally be generalized to align multiple sequences [WLXZ15], adding another new layer of algorithmic complexity, though. In this case, most multiple sequence alignment (MSA) formulations lead to NP-hard problems [Eli06]. Nevertheless, a variety of methods suitable for aligning multiple sequences have been developed, as they can outperform pairwise alignment methods on tasks such as phylogenetic inference [OR06], secondary structure prediction [CB99] or identification of conserved regions [SWD+11].

In order to overcome the cost of exact solutions, a number of MSA heuristics have been developed in recent years, most of them using the so-called progressive or iterative methods [HTHI95, SH14, TPP99, WOHN06]. Experimental data suggest that the robustness and accuracy of heuristics can still be improved, however [WLXZ15].

Most approaches for pairwise sequence alignment define edit distances as absolute values, lacking some normalization that would result in edit distances relative to the lengths of the sequences. However, some applications may require sequence lengths to be taken into account. For instance, a difference of one symbol between sequences of length 5 is more significant than between sequences of length 1000. In addition, experiments suggest that normalized edit distances can provide better results than both unnormalized or post-normalized edit distances [MV93]. While normalized edit distances have been developed for pairwise sequence alignment [AE99, MV93], none have been proposed for MSA to the best of our knowledge.

In this work, we propose exact and approximation algorithms for normalized MSA (NMSA). This is a first step towards the development of methods that take into account the lengths of sequences for computing edit distances when multiple sequences are compared. The remainder of this paper is organized as follows. Section 2 introduces concepts related to sequence alignment and presents normalized scores for NMSA, followed by the complexity analysis of NMSA using those scores in Section 3. Next, Sections 4 and 5 describe exact and approximation algorithms, respectively. Section 6 closes the paper with the conclusion and prospects for future work.

2 Preliminaries

An alphabet Σ\Sigma is a finite non-empty set of symbols. A finite sequence ss with nn symbols in Σ\Sigma is seen as s(1)s(n)s(1)\cdots s(n). We say that the length of ss, denoted by |s|\left|s\right|, is nn. The sequence s(p)s(q)s(p)\cdots s(q), with 1pqn1\leq p\leq q\leq n, is denoted by s(p:q)s(p\!:\!q). If p>qp>q, s(p:q)s(p\!:\!q) is the empty sequence which length is equal to zero and it is denoted by ε\varepsilon. We denote the sequence resulting from the concatenation of sequences ss and tt by st=sts\cdot t=st. A sequence of nn symbols 𝚊\mathtt{a} is denoted by 𝚊n\mathtt{a}^{n}. A kk-tuple SS over Σ\Sigma^{*} is called a kk-sequence and we write s1,,sks_{1},\ldots,s_{k} to refer to SS, where sis_{i} is the ii-th sequence in SS. We denote S=S=\emptyset if every sequence of SS is a empty sequence.

Let Σ-:=Σ{-}\Sigma_{\texttt{-}}:=\Sigma\cup\{\texttt{-}\}, where -Σ\texttt{-}\not\in\Sigma and the symbol - is called a space. Let S=s1,,skS=s_{1},\ldots,s_{k} be a kk-sequence. An alignment of SS is a kk-tuple A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}] over Σ-\Sigma_{\texttt{-}}^{*}, where

  1. (aa)

    each sequence shs^{\prime}_{h} is obtained by inserting spaces in shs_{h};

  2. (bb)

    |sh|=|si|\left|s^{\prime}_{h}\right|=\left|s^{\prime}_{i}\right| for each pair h,ih,i, with 1h,ik1\leq h,i\leq k;

  3. (cc)

    there is no jj in {1,,k}\{1,\ldots,k\} such that s1(j)==sk(j)=-s^{\prime}_{1}(j)=\ldots=s^{\prime}_{k}(j)=\texttt{-}.

Notice that alignments, which are kk-tuples over Σ-\Sigma_{\texttt{-}}^{*}, are written enclosed by square brackets “[][~]”. The sequence A(j)=[s1(j),,sk(j)]Σ-kA(j)=[s^{\prime}_{1}(j),\ldots,s^{\prime}_{k}(j)]\in\Sigma_{\texttt{-}}^{k} is the column jj of the alignment A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}] and A[j1:j2]A[j_{1}\!:\!j_{2}] is the alignment defined by the columns j1,j1+1,,j2j_{1},j_{1}+1,\ldots,j_{2} of AA. We say that the pair [sh(j),si(j)][s^{\prime}_{h}(j),s^{\prime}_{i}(j)] aligns in AA or, simply, that sh(j)s^{\prime}_{h}(j) and si(j)s^{\prime}_{i}(j) are aligned in AA, and |A|=|si|\left|A\right|=\left|s^{\prime}_{i}\right| is the length of the alignment AA. It is easy to check that maxi{|si|}|A|i|si|\max_{i}\{\left|s_{i}\right|\}\leq\left|A\right|\leq\sum_{i}\left|s_{i}\right|. We denote by 𝒜S\mathcal{A}_{S} the set of all alignments of SS.

An alignment can be used to represent editing operations of insertions, deletions and substitutions of symbols in sequences, where the symbol - represents insertions or deletions. An alignment can also be represented in the matrix format by placing one sequence above another. Thus, the alignments

[𝚊𝚊𝚊-,𝚊𝚋--,-𝚌𝚊𝚌]\displaystyle[\mathtt{a}\mathtt{a}\mathtt{a}\texttt{-},\mathtt{a}\mathtt{b}\texttt{-}\texttt{-},\texttt{-}\mathtt{c}\mathtt{a}\mathtt{c}] and [-𝚊𝚊𝚊-,𝚊𝚋---,-𝚌𝚊-𝚌]\displaystyle[\texttt{-}\mathtt{a}\mathtt{a}\mathtt{a}\texttt{-},\mathtt{a}\mathtt{b}\texttt{-}\texttt{-}\texttt{-},\texttt{-}\mathtt{c}\mathtt{a}\texttt{-}\mathtt{c}]

of 3-sequence 𝚊𝚊𝚊,𝚊𝚋,𝚌𝚊𝚌\mathtt{a}\mathtt{a}\mathtt{a},\mathtt{a}\mathtt{b},\mathtt{c}\mathtt{a}\mathtt{c} can be represented respectively as

[𝚊𝚊𝚊-𝚊𝚋---𝚌𝚊𝚌]\displaystyle\left[\begin{array}[]{cccc}\mathtt{a}&\mathtt{a}&\mathtt{a}&\texttt{-}\\ \mathtt{a}&\mathtt{b}&\texttt{-}&\texttt{-}\\ \texttt{-}&\mathtt{c}&\mathtt{a}&\mathtt{c}\end{array}\right] and [-𝚊𝚊𝚊-𝚊𝚋----𝚌𝚊-𝚌].\displaystyle\left[\begin{array}[]{ccccc}\texttt{-}&\mathtt{a}&\mathtt{a}&\mathtt{a}&\texttt{-}\\ \mathtt{a}&\mathtt{b}&\texttt{-}&\texttt{-}&\texttt{-}\\ \texttt{-}&\mathtt{c}&\mathtt{a}&\texttt{-}&\mathtt{c}\end{array}\right].

Let I={i1,,im}{1,,k}I=\{i_{1},\ldots,i_{m}\}\subseteq\{1,\ldots,k\} be a set of indices such that i1<<imi_{1}<\cdots<i_{m} and let A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}] be an alignment of a kk-sequence S=s1,,skS=s_{1},\ldots,s_{k}. We write SIS_{I} to denote the mm-sequence si1,,sims_{i_{1}},\ldots,s_{i_{m}}. The alignment of SIS_{I} induced by AA is the alignment AIA_{I} obtained from the alignment AA, considering only the corresponding sequences in SIS_{I} and, from the resulting structure, removing columns where all symbols are -. In the previous example, being A=[𝚊𝚊𝚊-,𝚊𝚋--,-𝚌𝚊𝚌]A=[\mathtt{a}\mathtt{a}\mathtt{a}\texttt{-},\mathtt{a}\mathtt{b}\texttt{-}\texttt{-},\texttt{-}\mathtt{c}\mathtt{a}\mathtt{c}] an alignment of 𝚊𝚊𝚊,𝚊𝚋,𝚌𝚊𝚌\mathtt{a}\mathtt{a}\mathtt{a},\mathtt{a}\mathtt{b},\mathtt{c}\mathtt{a}\mathtt{c}, we have

[𝚊𝚊𝚊𝚊𝚋-]\left[\begin{array}[]{ccc}\mathtt{a}&\mathtt{a}&\mathtt{a}\\ \mathtt{a}&\mathtt{b}&\texttt{-}\end{array}\right]

is an alignment of 𝚊𝚊𝚊,𝚊𝚋\mathtt{a}\mathtt{a}\mathtt{a},\mathtt{a}\mathtt{b} induced by AA.

A kk-vector ȷ=[j1,,jk]\vec{\jmath}=[j_{1},\ldots,j_{k}] is a kk-tuple, where ji={0,1,2,}j_{i}\in\mathbb{N}=\{0,1,2,\ldots\}. We say that jij_{i} is the ii-th element of ȷ\vec{\jmath}. The kk-vector 0\vec{0} is such that all its elements are zero. If ȷ\vec{\jmath} and h\vec{h} are kk-vectors, we write ȷh\vec{\jmath}\leq\vec{h} if jihij_{i}\leq h_{i} for each ii; and ȷ<h\vec{\jmath}<\vec{h} if ȷh\vec{\jmath}\leq\vec{h} and ȷh\vec{\jmath}\not=\vec{h}. A sequence of kk-vectors ȷ1,ȷ2,\vec{\jmath}_{1},\vec{\jmath}_{2},\ldots is in topological order if ȷi<ȷh\vec{\jmath}_{i}<\vec{\jmath}_{h} implies i<hi<h. Given two kk-vectors, say a=[a1,a2ak]\vec{a}=[a_{1},a_{2}\ldots a_{k}] b=[b1,b2bk]\vec{b}=[b_{1},b_{2}\ldots b_{k}], we say that a\vec{a} precedes b\vec{b} when there exists ll\in\mathbb{N} such that ai=bia_{i}=b_{i} for each i>li>l and al<bla_{l}<b_{l}. A sequence of kk-vectors ȷ1,ȷ2,\vec{\jmath}_{1},\vec{\jmath}_{2},\ldots is in lexicographical order if ȷi\vec{\jmath}_{i} precedes ȷh\vec{\jmath}_{h} for each i<hi<h. Clearly, a lexicographical order is a special case of topological order.

Consider S=s1,,skS=s_{1},\ldots,s_{k} a kk-sequence with ni=|si|n_{i}=\left|s_{i}\right| for each ii and we call n=[n1,,nk]\vec{n}=[n_{1},\ldots,n_{k}] the length of SS. Let Vn={ȷ:ȷn}V_{\vec{n}}=\{\vec{\jmath}:\vec{\jmath}\leq\vec{n}\}. Therefore, |Vn|=i(|si|+1)\left|V_{\vec{n}}\right|=\prod_{i}(\left|s_{i}\right|+1) which implies that if ni=nn_{i}=n for all ii, then |Vn|=(n+1)k\left|V_{\vec{n}}\right|=(n+1)^{k}. Define S(ȷ)=s1(j1),,sk(jk)S(\vec{\jmath})=s_{1}(j_{1}),\ldots,s_{k}(j_{k}) a column ȷ\vec{\jmath} in SS and we say that S(1:ȷ)=s1(1:j1),,sk(1:jk)S(1\!:\!\vec{\jmath})=s_{1}(1\!:\!j_{1}),\ldots,s_{k}(1\!:\!j_{k}) is the prefix of SS ending in ȷ\vec{\jmath}. Thus, S=S(1:n)S=S(1\!:\!\vec{n}). Besides that, if AA is an alignment and j=[j,j,,j]\vec{j}=[j,j,\ldots,j], then A[j]=A(j)A[\vec{j}]=A(j).

Denote by k\mathcal{B}^{{k}} the set of kk-vectors [b1,,bk][b_{1},\ldots,b_{k}], where bi{0,1}b_{i}\in\{0,1\} for each ii. Now, for bj\vec{b}\leq\vec{j}, where j=[j1,,jk]\vec{j}=[j_{1},\ldots,j_{k}], define

bS(j)=[x1,,xk]Σ-k,\vec{b}\cdot S(\vec{j})=[x_{1},\ldots,x_{k}]\in\Sigma_{\texttt{-}}^{k}\,,

such that

xi={si(ji),if bi=1-,otherwise.x_{i}=\left\{\begin{array}[]{ll}s_{i}(j_{i})\,,&\mbox{if $b_{i}=1$}\\ \texttt{-}\,,&\mbox{otherwise}\,.\end{array}\right.

Therefore, given an alignment AA of S(1:n)S(1\!:\!\vec{n}), there exists bk\vec{b}\in\mathcal{B}^{{k}}, with bn\vec{b}\leq\vec{n}, such that A(|A|)=bS(n)A(\left|A\right|)=\vec{b}\cdot S(\vec{n}). In this case, notice that bi=1b_{i}=1 if and only if si(ni)s_{i}(n_{i}) is in the ii-th row of the last column of AA and we say that b\vec{b} defines the column |A|\left|A\right| of alignment AA. For bȷ\vec{b}\leq\vec{\jmath}, we also define the operation

ȷb=[ȷ1b1,,ȷkbk].\vec{\jmath}-\vec{b}=[\jmath_{1}-b_{1},\ldots,\jmath_{k}-b_{k}]\,.

Notice that |k|=2k\left|\mathcal{B}_{k}\right|=2^{k}. Figure 1 shows an example of alignment using vectors to define columns and operations.

A=[𝚊𝚋𝚌---𝚋𝚌𝚊-𝚋-𝚋-𝚊𝚊𝚊𝚊𝚊𝚊𝚌----]\displaystyle A=\left[\begin{array}[]{ccccc}\mathtt{a}&\mathtt{b}&\mathtt{c}&\texttt{-}&\texttt{-}\\ \texttt{-}&\mathtt{b}&\mathtt{c}&\mathtt{a}&\texttt{-}\\ \mathtt{b}&\texttt{-}&\mathtt{b}&\texttt{-}&\mathtt{a}\\ \mathtt{a}&\mathtt{a}&\mathtt{a}&\mathtt{a}&\mathtt{a}\\ \mathtt{c}&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}\end{array}\right] \displaystyle\Rightarrow [A[1:4]][[0,0,1,1,0]S(1:[3,3,3,5,1])]\displaystyle\left[\begin{array}[]{c}\\ \\ A[1:4]\\ \\ \\ \end{array}\right]\left[\begin{array}[]{c}\\ \\ \left[0,0,1,1,0\right]\cdot S(1:[3,3,3,5,1])\\ \\ \\ \end{array}\right]
Figure 1: The length of alignment AA of S=𝚊𝚋𝚌,𝚋𝚌𝚊,𝚋𝚋𝚊,𝚊𝚊𝚊𝚊𝚊,𝚌S=\mathtt{a}\mathtt{b}\mathtt{c},\mathtt{b}\mathtt{c}\mathtt{a},\mathtt{b}\mathtt{b}\mathtt{a},\mathtt{a}\mathtt{a}\mathtt{a}\mathtt{a}\mathtt{a},\mathtt{c} is n=[|𝚊𝚋𝚌|,|𝚋𝚌𝚊|,|𝚋𝚋𝚊|,|𝚊𝚊𝚊𝚊𝚊|,|𝚌|]=[3,3,3,5,1]\vec{n}=[\left|\mathtt{a}\mathtt{b}\mathtt{c}\right|,\left|\mathtt{b}\mathtt{c}\mathtt{a}\right|,\left|\mathtt{b}\mathtt{b}\mathtt{a}\right|,\left|\mathtt{a}\mathtt{a}\mathtt{a}\mathtt{a}\mathtt{a}\right|,\left|\mathtt{c}\right|]=[3,3,3,5,1]. Here, S(n)=[𝚌,𝚊,𝚊,𝚊,𝚌]S(\vec{n})=[\mathtt{c},\mathtt{a},\mathtt{a},\mathtt{a},\mathtt{c}] and since b5=[0,0,1,1,0]\vec{b_{5}}=[0,0,1,1,0] define the last column of AA, we have S(nb5)=𝚊𝚋𝚌,𝚋𝚌𝚊,𝚋𝚋,𝚊𝚊𝚊𝚊,𝚌S(\vec{n}-\vec{b_{5}})=\mathtt{a}\mathtt{b}\mathtt{c},\mathtt{b}\mathtt{c}\mathtt{a},\mathtt{b}\mathtt{b},\mathtt{a}\mathtt{a}\mathtt{a}\mathtt{a},\mathtt{c}. The alignment defined by the first four columns of AA is A[1:4]A[1:4] which is an alignment of S(1:nb5)S(1:\vec{n}-\vec{b_{5}}) and the last column of AA is b5S(n)\vec{b_{5}}\cdot S(\vec{n}). It is interesting to note that AA can be completely defined by 55-vectors b1=[1,0,1,1,1],b2=[1,1,0,1,0],b3=[1,1,1,1,0],b4=[0,1,0,1,0],b5=[0,0,1,1,0]\vec{b_{1}}=[1,0,1,1,1],\vec{b_{2}}=[1,1,0,1,0],\vec{b_{3}}=[1,1,1,1,0],\vec{b_{4}}=[0,1,0,1,0],\vec{b_{5}}=[0,0,1,1,0] where bib_{i} represents column iiof alignment AA.

For a problem 𝐏\mathbf{P}, we call 𝕀𝐏\mathbb{I}_{\mathbf{P}} the set of instances of 𝐏\mathbf{P}. If 𝐏\mathbf{P} is a decision problem, then 𝐏(I){Yes,No}\mathbf{P}(I)\in\{\texttt{Yes},\texttt{No}\} is the image of an instance II. If 𝐏\mathbf{P} is an optimization (minimization) problem, there is a set Sol(I)\mathrm{Sol}(I) for each instance II, a function vv defining a non-negative rational number for each XSol(I)X\in\mathrm{Sol}(I), and a function optv(I)=minXSol(I){v(X)}\mathrm{opt}_{v}(I)=\min_{X\in\mathrm{Sol}(I)}\{v(X)\}. We use opt\mathrm{opt} instead of optv\mathrm{opt}_{v} if vv is obvious. Let 𝐀(I)=v(XSol(I))\mathbf{A}(I)=v(X\in\mathrm{Sol}(I)) a solution computed by an algorithm 𝐀\mathbf{A} with input II. We say that 𝐀\mathbf{A} is an α\alpha-approximation for 𝐏\mathbf{P} if 𝐀(I)αopt(I)\mathbf{A}(I)\leq\alpha\,\mathrm{opt}(I) for each I𝕀𝐏I\in\mathbb{I}_{\mathbf{P}}. We say that α\alpha is an approximation factor for 𝐏\mathbf{P}.

The alignment problem is a collection of decision and optimization problems whose instances are finite subsets of Σ\Sigma^{*} and Sol(S)=𝒜S\mathrm{Sol}(S)=\mathcal{A}_{S} for each instance SS. Function vv, used for scoring alignments, is called criterion for 𝐏\mathbf{P} and we call v[A]v[A] the cost of the alignment AA. The vv-optimal alignment AA of SS is such that v[A]=opt(S)v[A]=\mathrm{opt}(S). Thus, we state the following general optimization problems using the criterion vv and integer kk:

Problem 1 (Alignment with criterion vv).

Given a kk-sequence SS, find the cost of a vv-optimal alignment of SS.

We also need the decision version of the alignment problem with criterion vv where we are given a kk-sequence SS and a number dd\in\mathbb{Q}_{\geq}, and we want to decide whether there exists an alignment AA of SS such that v[A]dv[A]\leq d.

Usually the cost of an alignment vv is defined from a scoring matrix. A scoring matrix γ\gamma is a matrix whose elements are rational numbers. The matrix has rows and columns indexed by symbols in Σ-\Sigma_{\texttt{-}}. For 𝚊,𝚋Σ-\mathtt{a},\mathtt{b}\in\Sigma_{\texttt{-}} and a scoring matrix γ\gamma, we denote by γ𝚊𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}} the entry of γ\gamma in row 𝚊\mathtt{a} and column 𝚋\mathtt{b}. The value γ𝚊𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}} defines the score for a substitution if 𝚊,𝚋Σ\mathtt{a},\mathtt{b}\in\Sigma, for an insertion if 𝚊=-\mathtt{a}=\texttt{-}, and for a deletion if 𝚋=-\mathtt{b}=\texttt{-}. The entry γ--{\gamma}_{{\texttt{-}}\rightarrow{\texttt{-}}} is not defined.

2.1 vAγv{\mathrm{A}_{\gamma}}- and vNγv{\mathrm{N}_{\gamma}}-score for scoring alignments of 2-sequences

Consider a scoring matrix γ\gamma. Let S=s,tΣS=s,t\in\Sigma^{*} be a 2-sequence whose length is n=[n=|s|,m=|t|]\vec{n}=[n=\left|s\right|,m=\left|t\right|]. A simple criterion for scoring alignments of 2-sequences using the function vAγv{\mathrm{A}_{\gamma}} follows. For an alignment [s,t][s^{\prime},t^{\prime}] of s,ts,t we define

vAγ[s,t]=j=1|[s,t]|γs(j)t(j).\textstyle v{\mathrm{A}_{\gamma}}[s^{\prime},t^{\prime}]=\sum_{j=1}^{|[s^{\prime},t^{\prime}]|}{\gamma}_{{s^{\prime}\!(j)}\rightarrow{t^{\prime}\!(j)}}\,.

We say that vAγ[s,t]v{\mathrm{A}_{\gamma}}[s^{\prime},t^{\prime}] is a vAγv{\mathrm{A}_{\gamma}}-score of s,ts,t. The optimal function for this criterion is called unweighted edit distance and denoted by optAγ\mathrm{opt}{\mathrm{A}_{\gamma}}, and an optAγ\mathrm{opt}{\mathrm{A}_{\gamma}}-optimal alignment is also called an A-optimal alignment. When γ𝚊𝚊=0{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{a}}}=0 and γ𝚊𝚋=1{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=1 for each 𝚊𝚋Σ-\mathtt{a}\not=\mathtt{b}\in\Sigma_{\texttt{-}}, optAγ\mathrm{opt}{\mathrm{A}_{\gamma}} is also known as Levenshtein distance [Lev66].

Now, suppose that nmn\geq m. Needleman and Wunch [NW70] proposed an O(n2)O(n^{2})-time algorithm for computing

optAγ(s,t)=optAγ(S)=minA𝒜SvAγ[A]={0if S=,minbk,bn{optAγ(S[1:nb)+vAγ[bS]}otherwise.\mathrm{opt}{\mathrm{A}_{\gamma}}(s,t)=\mathrm{opt}{\mathrm{A}_{\gamma}}(S)=\min_{A\in\mathcal{A}_{S}}v{\mathrm{A}_{\gamma}}[A]=\left\{\begin{array}[]{ll}0&\mbox{if $S=\emptyset$},\\ \min_{\vec{b}\in\mathcal{B}^{{k}},\vec{b}\leq\vec{n}}\Big{\{}\mathrm{opt}{\mathrm{A}_{\gamma}}(S[1:\vec{n}-\vec{b})+v{\mathrm{A}_{\gamma}}[\vec{b}\cdot S]\Big{\}}&\mbox{otherwise}.\end{array}\right.

If optAγ\mathrm{opt}{\mathrm{A}_{\gamma}} is a Levenshtein distance, Masek and Paterson [MP80] presented an O(n2/logn)O(n^{2}/\log n)-time algorithm using the “Four Russian’s Method”. Crochemore, Landau and Ziv-Ukelson [CLZU03] extended optAγ\mathrm{opt}{\mathrm{A}_{\gamma}}-score supporting scoring matrices with real numbers and describing an O(n2/logn)O(n^{2}/\log n)-time algorithm. Indeed, there is no algorithm to determine optAγ(s,t)\mathrm{opt}{\mathrm{A}_{\gamma}}(s,t) in O(n2δ)O(n^{2-\delta})-time for any δ>0\delta>0, unless SETH is false [BI18]. Andoni, Krauthgamer and Onak [AKO10] described a nearly linear time algorithm approximating the edit distance within an approximation factor poly(logn)\textrm{poly}(\log n). Later, Chakraborty et al. [CDG+20] presented an O(n22/7)O(n^{2-2/7})-time α\alpha-approximation for edit distance, where α\alpha is constant.

Marzal and Vidal [MV93] defined another criterion for scoring alignments of two sequences called vNγv{\mathrm{N}_{\gamma}}-score, which is a normalization of vAγv{\mathrm{A}_{\gamma}}-score, as follows:

vNγ[A]={0,if |A|=0,vAγ[A]/|A|,otherwise.v{\mathrm{N}_{\gamma}}[A]=\left\{\begin{array}[]{ll}0\,,&\mbox{if $\left|A\right|=0$}\,,\\ v{\mathrm{A}_{\gamma}}[A]/\left|A\right|\,,&\mbox{otherwise}\,.\end{array}\right.

The optimal function for this criterion as known as normalized edit distance and it is denoted by optNγ\mathrm{opt}{\mathrm{N}_{\gamma}}, and an optNγ\mathrm{opt}{\mathrm{N}_{\gamma}}-optimal alignment is also called a N-optimal alignment of S=s,tS=s,t.

A naive dynamic programming algorithm was proposed by Marzal and Vidal [MV93] to obtain an N-optimal alignment of two sequences in O(n3)O(n^{3})-time. Using fractional programming, Vidal, Marzal and Aibar [VMA95] presented an algorithm with running time O(n3)O(n^{3}), but requiring only O(n2)O(n^{2})-time in practice which is similarly to the classical (unnormalized) edit distance algorithm. Further, Arslan and Egecioglu [AE99] described an O(n2logn)O(n^{2}\log n)-time algorithm to this problem.

Let S=s,tS=s,t a 2-sequence whose length is n\vec{n}. A simple heuristic for determining a close value for optNγ(s,t)\mathrm{opt}{\mathrm{N}_{\gamma}}(s,t) computes first optAγ(s,t)\mathrm{opt}{\mathrm{A}_{\gamma}}(s,t) and then divides it by the maximum length L(n)=L(S)L(\vec{n})=L(S) of an optimum alignment of SS. In order to find L(S)L(S), we use the same dynamic programming strategy for computing optAγ(s,t)\mathrm{opt}{\mathrm{A}_{\gamma}}(s,t), i.e.,

L(S)=L(n)={0if S=maxbk,bn,optAγ(S)=optAγ(Sb)+vAγ(bS){L(nb)}+1otherwise,L(S)=L(\vec{n})=\left\{\begin{array}[]{ll}0&\mbox{if $S=\emptyset$}\\ \max_{\vec{b}\in\mathcal{B}^{{k}},\vec{b}\leq\vec{n},\mathrm{opt}{\mathrm{A}_{\gamma}}(S)=\mathrm{opt}{\mathrm{A}_{\gamma}}(S-\vec{b})+v{\mathrm{A}_{\gamma}}(\vec{b}\cdot S)}\Big{\{}L(\vec{n}-\vec{b})\Big{\}}+1&\mbox{otherwise},\end{array}\right.

in O(nm)O(nm)-time. The following theorem shows that this heuristic is a simple approximation algorithm to find a N-optimal alignment.

Theorem 2.1.

Let s,ts,t be a 2-sequence of length [n,m][n,m] and let L(s,t)L(s,t) be the maximum length of an A-optimal alignment of s,ts,t. Then,

optAγ(s,t)L(S)2optNγ(s,t),\frac{\mathrm{opt}{\mathrm{A}_{\gamma}}(s,t)}{L(S)}\leq 2\,\mathrm{opt}{\mathrm{N}_{\gamma}}(s,t)\,,

and it can be computed in O(n2)O(n^{2})-time if n=mn=m. Moreover, this ratio is tight, i.e., for any positive rational ε\varepsilon, there exists a scoring matrix γ\gamma, sequences s,ts,t and an AA-optimal alignment of s,ts,t with maximum length AA such that

optAγ(s,t)|A|=vAγ[A]|A|=(2ε)optNγ(s,t).\frac{\mathrm{opt}{\mathrm{A}_{\gamma}}(s,t)}{\left|A\right|}=\frac{v{\mathrm{A}_{\gamma}}[A]}{\left|A\right|}=(2-\varepsilon)\,\mathrm{opt}{\mathrm{N}_{\gamma}}(s,t).
Proof.

Let AA be an AA-optimal alignment with maximum length computed by the heuristic above in O(nm)O(nm)-time. Let BB be an N-optimal alignment. Thus, vAγ[A]vAγ[B]v{\mathrm{A}_{\gamma}}[A]\leq v{\mathrm{A}_{\gamma}}[B]. Moreover, since |B|n+m\left|B\right|\leq n+m, n+m2max{n,m}n+m\leq 2\,\max\{n,m\} and max{n,m}2|A|\max\{n,m\}\leq 2\left|A\right|, we have that |A||B|/2\left|A\right|\geq\left|B\right|/2. Therefore, vNγ[A]=vAγ[A]|A|vAγ[B]|A|vAγ[B]|B|/2=2optNγ(s,t).v{\mathrm{N}_{\gamma}}[A]=\frac{v{\mathrm{A}_{\gamma}}[A]}{\left|A\right|}\leq\frac{v{\mathrm{A}_{\gamma}}[B]}{\left|A\right|}\leq\frac{v{\mathrm{A}_{\gamma}}[B]}{\left|B\right|/2}=2\,\mathrm{opt}{\mathrm{N}_{\gamma}}(s,t)\,.

We present now a 2-sequence s,ts,t and a scoring matrix γ\gamma such that the solution given by the heuristic is at least (2ε)optNγ(s,t)(2-\varepsilon)\mathrm{opt}{\mathrm{N}_{\gamma}}(s,t) for any ε\varepsilon in >\mathbb{Q}_{>}. Let Σ={𝚊,𝚋}\Sigma=\{\mathtt{a},\mathtt{b}\}, γ\gamma be a scoring matrix such that γ𝚊-=γ𝚋-=1/ε{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}={\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}}=1/\varepsilon and γ𝚊𝚋=2/ε1{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=2/\varepsilon-1 and 𝚊n,𝚋nΣ\mathtt{a}^{n},\mathtt{b}^{n}\in\Sigma^{*}, where nn in a positive integer. Observe that the vAγv{\mathrm{A}_{\gamma}}-score of any alignment of (𝚊n,𝚋n)(\mathtt{a}^{n},\mathtt{b}^{n}), where [𝚊,𝚋][\mathtt{a},\mathtt{b}] is aligned in kk columns, is equal to 2n/εk2n/\varepsilon-k. Thus, optAγ(𝚊n,𝚋n)=min0kn{2n/εk}=2n/εn=(2/ε1)n\mathrm{opt}{\mathrm{A}_{\gamma}}(\mathtt{a}^{n},\mathtt{b}^{n})=\min_{0\leq k\leq n}\{2n/\varepsilon-k\}=2n/\varepsilon-n=(2/\varepsilon-1)\,n which implies [𝚊n,𝚋n][\mathtt{a}^{n},\mathtt{b}^{n}] is the AA-optimal alignment with maximum length. Since optNγ(𝚊n,𝚋n)vNγ([𝚊n-n,-n𝚋n])=1/ε\mathrm{opt}{\mathrm{N}_{\gamma}}(\mathtt{a}^{n},\mathtt{b}^{n})\leq v{\mathrm{N}_{\gamma}}([\mathtt{a}^{n}\texttt{-}^{n},\texttt{-}^{n}\mathtt{b}^{n}])=1/\varepsilon, it follows

optAγ(𝚊n,𝚋n)|[𝚊n,𝚋n]|\displaystyle\frac{\mathrm{opt}{\mathrm{A}_{\gamma}}(\mathtt{a}^{n},\mathtt{b}^{n})}{\left|[\mathtt{a}^{n},\mathtt{b}^{n}]\right|} =vAγ[𝚊n,𝚋n]|[𝚊n,𝚋n]|=(2/ε1)nn=2εε=(2ε)vNγ([𝚊n-n,-n𝚋n])(2ε)optNγ(𝚊n,𝚋n).\displaystyle=\frac{v{\mathrm{A}_{\gamma}}[\mathtt{a}^{n},\mathtt{b}^{n}]}{\left|[\mathtt{a}^{n},\mathtt{b}^{n}]\right|}=\frac{(2/\varepsilon-1)\,n}{n}=\frac{2-\varepsilon}{\varepsilon}=(2-\varepsilon)\,v{\mathrm{N}_{\gamma}}([\mathtt{a}^{n}\texttt{-}^{n},\texttt{-}^{n}\mathtt{b}^{n}])\geq(2-\varepsilon)\,\mathrm{opt}{\mathrm{N}_{\gamma}}(\mathtt{a}^{n},\mathtt{b}^{n})\,.

We define now classes of scoring matrices. The most common class of scoring matrices 𝕄C\mathbb{M}^{\mathrm{C}} has the following properties: for all 𝚊,𝚋,𝚌,Σ-\mathtt{a},\mathtt{b},\mathtt{c},\in\Sigma_{\texttt{-}}, we have (a) γ𝚊𝚋>0{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}>0 if 𝚊𝚋\mathtt{a}\not=\mathtt{b}, and γ𝚊𝚋=0{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=0 if 𝚊=𝚋\mathtt{a}=\mathtt{b}; (b) γ𝚊𝚋=γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}}; and (c) γ𝚊𝚌γ𝚊𝚋+γ𝚋𝚌{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{c}}}\leq{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}+{\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{c}}}. The class 𝕄W\mathbb{M}^{\mathrm{W}} of scoring matrices is such that, for all symbols 𝚊,𝚋,𝚌Σ\mathtt{a},\mathtt{b},\mathtt{c}\in\Sigma, we have (a) γ𝚊-=γ-𝚊>0{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}={\gamma}_{{\texttt{-}}\rightarrow{\mathtt{a}}}>0; (b) γ𝚊𝚋>0{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}>0 if 𝚊𝚋\mathtt{a}\not=\mathtt{b}, and γ𝚊𝚋=0{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=0 if 𝚊=𝚋\mathtt{a}=\mathtt{b}; (c) if γ𝚊𝚋<γ𝚊-+γ-𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}<{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}+{\gamma}_{{\texttt{-}}\rightarrow{\mathtt{b}}}, then γ𝚊𝚋=γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}}; (d) γ𝚊-γ𝚊𝚋+γ𝚋-{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}\leq{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}+{\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}}; and (e) min{γ𝚊𝚌,γ𝚊-+γ-𝚌}γ𝚊𝚋+γ𝚋𝚌\min\{{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{c}}},{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}+{\gamma}_{{\texttt{-}}\rightarrow{\mathtt{c}}}\}\leq{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}+{\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{c}}}. Moreover, the class 𝕄N\mathbb{M}^{\mathrm{N}} is such that (a) 𝕄N𝕄W\mathbb{M}^{\mathrm{N}}\subseteq\mathbb{M}^{\mathrm{W}} and (b) γ𝚊-2γ𝚋-{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}\leq 2\,{\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}} for each 𝚊,𝚋Σ\mathtt{a},\mathtt{b}\in\Sigma.

For a set SS, we say that f:S×Sf:S\times S\rightarrow\mathbb{R} is a distance function or metric on SS if ff satisfies for all s,t,uSs,t,u\in S,:

  1. 1.

    f(s,s)=0f(s,s)=0 (reflexivity);

  2. 2.

    f(s,t)>0f(s,t)>0 if sts\not=t (strict positiveness);

  3. 3.

    f(s,t)=f(t,s)f(s,t)=f(t,s) (symmetry); and

  4. 4.

    f(s,u)f(s,t)+f(t,u)f(s,u)\leq f(s,t)+f(t,u) (triangle inequality).

If a given criterion vv depends on a scoring matrix γ\gamma and it is a metric on Σ\Sigma^{*}, we say that the scoring matrix γ\gamma induces a optv\mathrm{opt}_{v}-distance on Σ\Sigma^{*}. Sellers [Sel74] showed that matrices in 𝕄C\mathbb{M}^{\mathrm{C}} induce an optAγ\mathrm{opt}{\mathrm{A}_{\gamma}}-distance on Σ\Sigma^{*}. Araujo and Soares [AS06] showed that γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} and γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}} if and only if γ\gamma induces an optAγ\mathrm{opt}{\mathrm{A}_{\gamma}}-distance and an optNγ\mathrm{opt}{\mathrm{N}_{\gamma}}-distance on Σ\Sigma^{*} respectively. Figure 2 shows the relationship between these classes.

𝕄W\mathbb{M}^{\mathrm{W}}𝕄C\mathbb{M}^{\mathrm{C}}𝕄N\mathbb{M}^{\mathrm{N}}
Figure 2: Relationship between scoring matrices. Araujo and Soares [AS06] showed that 𝕄C𝕄W\mathbb{M}^{\mathrm{C}}\subseteq\mathbb{M}^{\mathrm{W}}, 𝕄N𝕄W\mathbb{M}^{\mathrm{N}}\subseteq\mathbb{M}^{\mathrm{W}}, 𝕄C𝕄N\mathbb{M}^{\mathrm{C}}\not\subseteq\mathbb{M}^{\mathrm{N}} and 𝕄N𝕄C\mathbb{M}^{\mathrm{N}}\not\subseteq\mathbb{M}^{\mathrm{C}}. Moreover, the scoring matrix γ\gamma such that γ𝚊𝚊=0{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{a}}}=0 for each 𝚊\mathtt{a} and γ𝚊𝚋=1{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=1 for each 𝚊𝚋\mathtt{a}\not=\mathtt{b} is in 𝕄C𝕄N\mathbb{M}^{\mathrm{C}}\cap\mathbb{M}^{\mathrm{N}}, which implies that 𝕄C𝕄N\mathbb{M}^{\mathrm{C}}\cap\mathbb{M}^{\mathrm{N}}\not=\emptyset.

Given a scoring function vv for alignments of 2-sequences s,ts,t that depends on a scoring matrix, we say that two scoring matrices γ\gamma and ρ\rho are equivalent considering vv when vγ[A]vγ[B]v_{\gamma}[A]\leq v_{\gamma}[B] if and only if vρ[A]vρ[B]v_{\rho}[A]\leq v_{\rho}[B] for any pair of alignments A,BA,B of sequences s,ts,t. If ρ\rho is a matrix obtained from γ\gamma by multiplying each entry of γ\gamma by a constant c>0c>0, then vAρ[A]=cvAγ[A]v{\mathrm{A}_{\rho}}[A]=c\cdot v{\mathrm{A}_{\gamma}}[A] and vNρ[A]=cvNγ[A]v{\mathrm{N}_{\rho}}[A]=c\cdot v{\mathrm{N}_{\gamma}}[A], which implies that γ\gamma and ρ\rho are equivalent. As a consequence, when the scoring function is vAγv{\mathrm{A}_{\gamma}} or vNγv{\mathrm{N}_{\gamma}} and it is convenient, we can suppose that all entries of γ\gamma are integers instead of rationals.

2.2 vSPγv{\mathrm{SP}_{\!\gamma}}-score for kk sequences

Consider a scoring matrix γ\gamma. Let S=s1,,skS=s_{1},\ldots,s_{k} be a kk-sequence whose length is n\vec{n} and A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}] be an alignment of SS. The criterion vSPγv{\mathrm{SP}_{\!\gamma}}, also called SP-score, for scoring the alignment AA is

vSPγ[A]=h=1k1i=h+1kvAγ[A{h,i}].\displaystyle\textstyle v{\mathrm{SP}_{\!\gamma}}[A]=\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]\,. (3)

We define optSPγ\mathrm{opt}{\mathrm{SP}_{\!\gamma}} as the optimal function for the criterion vSPγv{\mathrm{SP}_{\!\gamma}}. An alignment AA of SS such that vSPγ[A]=optSPγ(S)v{\mathrm{SP}_{\!\gamma}}[A]=\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S) is called vSPγv{\mathrm{SP}_{\!\gamma}}-optimal alignment. Regardless its decision or optimization version, we call the associated problem as multiple sequence alignment problem (𝐌𝐒𝐀\mathbf{MSA}). Formally,

Problem 2 (Multiple sequence alignment).

Let γ\gamma be a fixed scoring matrix. Given a kk-sequence SS, find optSPγ(S)\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S).

In order to compute optSPγ\mathrm{opt}{\mathrm{SP}_{\!\gamma}}, we extend the definition of vSPγv{\mathrm{SP}_{\!\gamma}} considering a column of an alignment A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}] as its parameter. Thus, vSPγ[A(j)]=vSPγ[A[j]]=i<hγsi(j)sh(j)v{\mathrm{SP}_{\!\gamma}}[A(j)]=v{\mathrm{SP}_{\!\gamma}}[A[\vec{j}]]=\sum_{i<h}{\gamma}_{{s^{\prime}_{i}(j)}\rightarrow{s^{\prime}_{h}(j)}} assuming that γ--=0{\gamma}_{{\texttt{-}}\rightarrow{\texttt{-}}}=0 and j=[j,,j]\vec{j}=[j,\ldots,j] and

optSPγ(S)\displaystyle\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S) =minA𝒜SvSPγ[A]={0if S=minbk,bn{optSPγ(S(1:nb))+vSPγ[bS(n)]}otherwise.\displaystyle=\min_{A\in\mathcal{A}_{S}}v{\mathrm{SP}_{\!\gamma}}[A]=\left\{\begin{array}[]{ll}0&\mbox{if $S=\emptyset$}\\ \textstyle\min_{\vec{b}\in\mathcal{B}^{{k}},\vec{b}\leq\vec{n}}\Big{\{}\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S(1\!:\!\vec{n}-\vec{b}))+v{\mathrm{SP}_{\!\gamma}}[\vec{b}\cdot S(\vec{n})]\Big{\}}&\mbox{otherwise}.\end{array}\right. (6)

Recurrence (6) can be computed using a dynamic programming algorithm, obtaining D(ȷ)=optSPγ(S(1:ȷ))D(\vec{\jmath})=\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S(1\!:\!\vec{\jmath})) for all ȷn\vec{\jmath}\leq\vec{n}. This task can be performed by generating all indexes of DD in lexicographical order, starting with D(0)=0D(\vec{0})=0, as presented in Algorithm 1.

Algorithm 1
0:S=s1,,sk(Σ)kS=s_{1},\ldots,s_{k}\in(\Sigma^{*})^{k}
0:optSPγ(S)\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S)
1:D(0)0D(\vec{0})\leftarrow 0
2:for each ȷn\vec{\jmath}\leq\vec{n} in lexicographical order do
3:  D(ȷ)minbk,bȷ{D(ȷb)+vSPγ[bS(ȷ)]}D(\vec{\jmath})\leftarrow\min_{\vec{b}\in\mathcal{B}^{{k}},\,\vec{b}\leq\vec{\jmath}}\big{\{}D(\vec{\jmath}-\vec{b})+v{\mathrm{SP}_{\!\gamma}}[\vec{b}\cdot S(\vec{\jmath})]\big{\}}
4:return  D(n)D(\vec{n})

Suppose that |si|=n\left|s_{i}\right|=n for each ii. Notice that the space to store the matrix DD is Θ((n+1)k)\Theta((n+1)^{k}) and thus Algorithm 1 uses Θ((n+1)k)\Theta((n+1)^{k})-space. Besides that, Algorithm 1 checks, in the worst case, Θ(2k)\Theta(2^{k}) entries for computing all entries in the matrix DD and each computation spends Θ(k2)\Theta(k^{2})-time. Therefore, its running time is O(2kk2(n+1)k)O(2^{k}k^{2}(n+1)^{k}). Observe that when the distance is small, not all entries in DD need to be computed, such as in the Carrillo and Lipman’s algorithm [CL88].

We can also describe an obvious but unusual variant of this problem that consider a scoring matrix for each pair of the sequences. Formally, considering an scoring matrix array γ=[γ(12),γ(13),,γ(1k),γ(23),,γ(2k),,γ((k1)k)]\vec{\gamma}=[\gamma^{(12)},\gamma^{(13)},\ldots,\gamma^{(1k)},\gamma^{(23)},\ldots,\gamma^{(2k)},\ldots,\gamma^{((k-1)k)}] and an alignment AA of a kk-sequence,

vSPγ[A]=h=1k1i=h+1kvAγ(hi)[A{h,i}].\displaystyle v{\mathrm{SP}_{\!\vec{\gamma}}}[A]=\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}v{\mathrm{A}_{\gamma^{(hi)}}}[A_{\{h,i\}}]\,.

Thus, given a kk-sequence SS, we ask for finding optSPγ(S)=minA𝒜S{vSPγ[A]}\mathrm{opt}{\mathrm{SP}_{\!\vec{\gamma}}}(S)=\min_{A\in\mathcal{A}_{S}}\{v{\mathrm{SP}_{\!\vec{\gamma}}}[A]\}. Algorithm 1 can be easily modified in order to solve this extended version with same time and space complexity. This is important here because this version is used as a subroutine of one of the normalized version.

2.3 Vγi{\mathrm{V}^{i}_{\!\gamma}}-score for kk sequences

In this section we define a new criteria to normalize the vSPγv{\mathrm{SP}_{\!\gamma}}-score of a multiple alignment. The new criteria for aligning sequences takes into account the length of the alignments according to the following:

Vγ1[A]\displaystyle{\mathrm{V}^{1}_{\!\gamma}}[A] ={0,if |A|=0 ,vSPγ[A]/|A|,otherwise,\displaystyle=\left\{\begin{array}[]{ll}0\,,&\mbox{if $\left|A\right|=0$\,,}\\ v{\mathrm{SP}_{\!\gamma}}[A]/\left|A\right|\,,&\mbox{otherwise}\,,\\ \end{array}\right. (9)
Vγ2[A]\displaystyle{\mathrm{V}^{2}_{\!\gamma}}[A] =h=1k1i=h+1kvNγ[A{h,i}],\displaystyle=\textstyle\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}v{\mathrm{N}_{\gamma}}[A_{\{h,i\}}]\,, (10)
Vγ3[A]\displaystyle{\mathrm{V}^{3}_{\!\gamma}}[A] ={0,if |A|=0 ,vSPγ[A]/(h=1k1i=h+1k|A{h,i}|),.otherwise.\displaystyle=\left\{\begin{array}[]{ll}0\,,&\mbox{if $\left|A\right|=0$\,,}\\ v{\mathrm{SP}_{\!\gamma}}[A]\big{/}\big{(}\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}|A_{\{h,i\}}|\big{)}\,,\big{.}&\mbox{otherwise}\,.\end{array}\right. (13)

We define optNSPγz\mathrm{opt}{\mathrm{NSP}^{z}_{\!\gamma}} as the optimal function for the criterion Vγz{\mathrm{V}^{z}_{\!\gamma}}, i.e., for a given kk-sequence SS

optNSPγz(S)=minA𝒜SVγz[A].\mathrm{opt}{\mathrm{NSP}^{z}_{\!\gamma}}(S)=\min_{A\in\mathcal{A}_{S}}{\mathrm{V}^{z}_{\!\gamma}}[A].

An alignment AA of SS such that Vγz[A]=optNSPγz(S){\mathrm{V}^{z}_{\!\gamma}}[A]=\mathrm{opt}{\mathrm{NSP}^{z}_{\!\gamma}}(S) is called Vγz{\mathrm{V}^{z}_{\!\gamma}}-optimal alignment. Moreover, regardless its decision or optimization version, we establish the criterion Vγz{\mathrm{V}^{z}_{\!\gamma}} for the normalized multiple sequence alignment problem (𝐍𝐌𝐒𝐀-z\mathbf{NMSA}\text{-}z), for z=1,2,3z=1,2,3. Formally, for a fixed scoring matrix γ\gamma and z{1,2,3}z\in\{1,2,3\},

Problem 3 (Normalized multiple sequence alignment with score Vγz{\mathrm{V}^{z}_{\!\gamma}}).

Given a kk-sequence SS, find optNSPγz(S)\mathrm{opt}{\mathrm{NSP}^{z}_{\!\gamma}}(S).

An interesting question is whether the definitions above represent the same criterion. i.e., would it be possible that the optimal alignment for a given criterion Vγz{\mathrm{V}^{z}_{\!\gamma}} also represents an optimal alignment for another criterion Vγz{\mathrm{V}^{z^{\prime}}_{\!\gamma}}, zzz\not=z^{\prime}, regardless of the sequences and scoring matrices? The answer is no and Figure 3 shows examples that support this claim.

γ=𝚊𝚋𝚌-𝚊09910𝚋90910𝚌99010-1010100A=[𝚊𝚋𝚌]B=[𝚊-𝚋--𝚌]C=[𝚊---𝚋---𝚌]\gamma=\begin{array}[]{c|cccc}&\mathtt{a}&\mathtt{b}&\mathtt{c}&\texttt{-}\\ \hline\cr\mathtt{a}&0&9&9&10\\ \mathtt{b}&9&0&9&10\\ \mathtt{c}&9&9&0&10\\ \texttt{-}&10&10&10&0\\ \end{array}\hskip 28.45274ptA=\left[\begin{array}[]{c}\mathtt{a}\\ \mathtt{b}\\ \mathtt{c}\end{array}\right]\hskip 28.45274ptB=\left[\begin{array}[]{cc}\mathtt{a}&\texttt{-}\\ \mathtt{b}&\texttt{-}\\ \texttt{-}&\mathtt{c}\end{array}\right]\hskip 28.45274ptC=\left[\begin{array}[]{ccc}\mathtt{a}&\texttt{-}&\texttt{-}\\ \texttt{-}&\mathtt{b}&\texttt{-}\\ \texttt{-}&\texttt{-}&\mathtt{c}\end{array}\right]
δ=𝚊𝚋𝚌-𝚊0779𝚋7079𝚌7709-9990D=[𝚊𝚋𝚌𝚊𝚌𝚋𝚌𝚋𝚊]E=[𝚊𝚋𝚌-𝚊-𝚌𝚋𝚌𝚋𝚊-]F=[𝚊𝚋𝚌--𝚊-𝚌𝚋---𝚌𝚋𝚊]\delta=\begin{array}[]{c|cccc}&\mathtt{a}&\mathtt{b}&\mathtt{c}&\texttt{-}\\ \hline\cr\mathtt{a}&0&7&7&9\\ \mathtt{b}&7&0&7&9\\ \mathtt{c}&7&7&0&9\\ \texttt{-}&9&9&9&0\\ \end{array}\hskip 28.45274ptD=\left[\begin{array}[]{ccc}\mathtt{a}&\mathtt{b}&\mathtt{c}\\ \mathtt{a}&\mathtt{c}&\mathtt{b}\\ \mathtt{c}&\mathtt{b}&\mathtt{a}\end{array}\right]\hskip 28.45274ptE=\left[\begin{array}[]{cccc}\mathtt{a}&\mathtt{b}&\mathtt{c}&\texttt{-}\\ \mathtt{a}&\texttt{-}&\mathtt{c}&\mathtt{b}\\ \mathtt{c}&\mathtt{b}&\mathtt{a}&\texttt{-}\end{array}\right]\hskip 28.45274ptF=\left[\begin{array}[]{ccccc}\mathtt{a}&\mathtt{b}&\mathtt{c}&\texttt{-}&\texttt{-}\\ \mathtt{a}&\texttt{-}&\mathtt{c}&\mathtt{b}&\texttt{-}\\ \texttt{-}&\texttt{-}&\mathtt{c}&\mathtt{b}&\mathtt{a}\end{array}\right]
Figure 3: Observe first that AA, BB or CC is an Vγz{\mathrm{V}^{z}_{\!\gamma}}-optimal alignment of 3-sequence 𝚊,𝚋,𝚌\mathtt{a},\mathtt{b},\mathtt{c} for each zz. Besides, since Vγ1[A]=27.0,Vγ2[A]=27.0,Vγ3[A]=9.0;Vγ1[B]=24.5,Vγ2[B]=29.0,Vγ3[B]=9.8;Vγ1[C]=20.0,Vγ2[C]=30.0,Vγ3[C]=10.0{\mathrm{V}^{1}_{\!\gamma}}[A]=27.0,{\mathrm{V}^{2}_{\!\gamma}}[A]=27.0,{\mathrm{V}^{3}_{\!\gamma}}[A]=9.0;{\mathrm{V}^{1}_{\!\gamma}}[B]=24.5,{\mathrm{V}^{2}_{\!\gamma}}[B]=29.0,{\mathrm{V}^{3}_{\!\gamma}}[B]=9.8;{\mathrm{V}^{1}_{\!\gamma}}[C]=20.0,{\mathrm{V}^{2}_{\!\gamma}}[C]=30.0,{\mathrm{V}^{3}_{\!\gamma}}[C]=10.0, we have that CC is an optimal alignment for criterion Vγ1{\mathrm{V}^{1}_{\!\gamma}} but it is not for criteria Vγ2{\mathrm{V}^{2}_{\!\gamma}} or Vγ3{\mathrm{V}^{3}_{\!\gamma}}, which implies that an optimal alignment for criterion Vγ1{\mathrm{V}^{1}_{\!\gamma}} is different when we compare it to criteria Vγ2{\mathrm{V}^{2}_{\!\gamma}} and Vγ3{\mathrm{V}^{3}_{\!\gamma}}. Now, observe that DD, EE or FF is an Vδz{\mathrm{V}^{z}_{\!\delta}}-optimal alignment of 3-sequence 𝚊𝚋𝚌,𝚊𝚌𝚋,𝚌𝚋𝚊\mathtt{a}\mathtt{b}\mathtt{c},\mathtt{a}\mathtt{c}\mathtt{b},\mathtt{c}\mathtt{b}\mathtt{a} for each zz. Besides, since Vδ2[D]=16.33,Vδ3[D]=5.44;Vδ2[E]=17.16,Vδ3[E]=5.81;Vδ2[F]=16.20,Vδ3[F]=5.53{\mathrm{V}^{2}_{\!\delta}}[D]=16.33,{\mathrm{V}^{3}_{\!\delta}}[D]=5.44;{\mathrm{V}^{2}_{\!\delta}}[E]=17.16,{\mathrm{V}^{3}_{\!\delta}}[E]=5.81;{\mathrm{V}^{2}_{\!\delta}}[F]=16.20,{\mathrm{V}^{3}_{\!\delta}}[F]=5.53, we have that DD is an optimal alignment for criterion Vδ3{\mathrm{V}^{3}_{\!\delta}} but it is not for criterion Vδ2{\mathrm{V}^{2}_{\!\delta}} which implies that an optimal alignment for criterion Vδ2{\mathrm{V}^{2}_{\!\delta}} is different when we compare it to criterion Vδ3{\mathrm{V}^{3}_{\!\delta}}.

3 Complexity

We study now the complexity of the multiple sequence alignment problem for each new criterion defined in Section 2. We consider the decision version of the computational problems and we prove 𝐍𝐌𝐒𝐀-z\mathbf{NMSA}\text{-}z is NP-complete for each zz even though the following additional restrictions for the scoring matrix γ\gamma hold: γ𝚊𝚋=γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}} and γ𝚊𝚋=0{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=0 if and only if 𝚊=𝚋\mathtt{a}=\mathtt{b} for each pair 𝚊,𝚋Σ-\mathtt{a},\mathtt{b}\in\Sigma_{\texttt{-}}. Elias [Eli06] shows that, even considering such restrictions, 𝐌𝐒𝐀\mathbf{MSA} is NP-complete. We start showing a polynomial time reduction from 𝐌𝐒𝐀\mathbf{MSA} to 𝐍𝐌𝐒𝐀-z\mathbf{NMSA}\text{-}z.

For an instance (S,C)(S,C) of 𝐌𝐒𝐀\mathbf{MSA} (𝐍𝐌𝐒𝐀-z\mathbf{NMSA}\text{-}z), where S=s1,,skS=s_{1},\ldots,s_{k} is a kk-sequence and CC is an integer, 𝐌𝐒𝐀(S,C)\mathbf{MSA}(S,C) (𝐍𝐌𝐒𝐀-z(S,C)\mathbf{NMSA}\text{-}z(S,C)) is the decision problem version asking whether there exists an alignment AA of SS such that vSPγ[A]Cv{\mathrm{SP}_{\!\gamma}}[A]\leq C (Vγi[A]C{\mathrm{V}^{i}_{\!\gamma}}[A]\leq C).

Consider a fixed alphabet Σ\Sigma and scoring matrix γ\gamma with the restrictions above that are γ𝚊𝚋=γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}} and γ𝚊𝚋=0{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=0. We also assume each entry of γ\gamma is integer. Let σΣ-\sigma\not\in\Sigma_{\texttt{-}} be a new symbol and Σσ=Σ{σ}\Sigma^{\sigma}=\Sigma\cup\{\sigma\} and GG the maximum number in γ\gamma. We define a scoring matrix γσ\gamma^{\sigma} such that γσ𝚊𝚋=γ𝚊𝚋,γσ𝚊σ=γσσ𝚊=G{\gamma^{\sigma}\!\!\!}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}},{\gamma^{\sigma}\!\!\!}_{{\mathtt{a}}\rightarrow{\sigma}}={\gamma^{\sigma}\!\!\!}_{{\sigma}\rightarrow{\mathtt{a}}}=G and γσσσ=0{\gamma^{\sigma}\!\!\!}_{{\sigma}\rightarrow{\sigma}}=0 for each pair 𝚊,𝚋Σ-\mathtt{a},\mathtt{b}\in\Sigma_{\texttt{-}}. Notice that γσ𝚊𝚋=γσ𝚋𝚊{\gamma^{\sigma}}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma^{\sigma}}_{{\mathtt{b}}\rightarrow{\mathtt{a}}} and γσ𝚊𝚋=0{\gamma^{\sigma}}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=0 if and only if 𝚊=𝚋\mathtt{a}=\mathtt{b} for each pair 𝚊,𝚋Σ-σ\mathtt{a},\mathtt{b}\in\Sigma^{\sigma}_{\texttt{-}} and therefore G1G\geq 1. Also, we denote SL=s1σL,,skσLS^{L}=s_{1}\sigma^{L},\ldots,s_{k}\sigma^{L}, where L=Nk2MGL=Nk^{2}MG, M=maxi{|si|}M=\max_{i}\{\left|s_{i}\right|\} and N=(k2)MN=\binom{k}{2}M.

Let AA be an alignment of SLS^{L}. A σ\sigma-column in AA is a column where every symbol is equal to σ\sigma. The tail of AA is the alignment A[j+1:|A|]A[j+1:\left|A\right|] if each its column is σ\sigma-columns but A(j)A(j) is not; in this case the tail length of AA is |A|j\left|A\right|-j and the column jj is the tail base. We say that an alignment of SLS^{L} is canonical if its tail length is LL. If A=[s1′′,,sk′′]A=[s_{1}^{\prime\prime},\ldots,s_{k}^{\prime\prime}] is the alignment of SS, then we denote by ALA^{L} the canonical alignment [s1′′σL,,sk′′σL][s_{1}^{\prime\prime}\sigma^{L},\ldots,s_{k}^{\prime\prime}\sigma^{L}] of SLS^{L}, i.e., AL[1:|A|]=AA^{L}[1:\left|A\right|]=A and AL[|A|+1:|AL|]=[σL,σL,,σL]A^{L}[\left|A\right|+1:\left|A^{L}\right|]=[\sigma^{L},\sigma^{L},\ldots,\sigma^{L}]. Notice that |AL|L\left|A^{L}\right|\geq L. Then, we establish below a lower bound for vSPγσ[AL]v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}] and vSPγ[A]v{\mathrm{SP}_{\!\gamma}}[A].

Proposition 3.1.

Let AA be an alignment of a kk-sequence S=s1,s2,,skS=s_{1},s_{2},\ldots,s_{k}. Then, vSPγσ[AL]=vSPγ[A]k2MGv{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]=v{\mathrm{SP}_{\!\gamma}}[A]\leq k^{2}MG.

Proof.

Suppose that A=[s1,,sk]A=[s_{1}^{\prime},\ldots,s_{k}^{\prime}] and then AL=[s1σL,,skσL]A^{L}=[s_{1}^{\prime}\sigma^{L},\ldots,s_{k}^{\prime}\sigma^{L}]. Because M=maxi|si|M=\max_{i}\left|s_{i}\right|, each alignment that is induced by two sequences in AA has at most 2M2M columns. Moreover, each entry in γ\gamma is at most GG. It follows that vAγ[sh,si]=vAγ[shσL,siσL]2MGv{\mathrm{A}_{\gamma}}[s_{h}^{\prime},s_{i}^{\prime}]=v{\mathrm{A}_{\gamma}}[s_{h}^{\prime}\sigma^{L},s_{i}^{\prime}\sigma^{L}]\leq 2MG for each pair h,ih,i and then

vSPγσ[AL]=vSPγ[A]=h=1k1i=h+1kvAγ[A{h,i}]h=1k1i=h+1k2MG=(k2)2MGk2MG.v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]=v{\mathrm{SP}_{\!\gamma}}[A]=\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]\leq\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}2MG=\textstyle\binom{k}{2}2MG\leq k^{2}MG\,.

The two following result are useful to prove Theorem 3.4, which is the main result of this section. Let C1:=C2:=C/L,C3:=C/((k2)L)C^{1}:=C^{2}:=C/L,C^{3}:=C/\big{(}\binom{k}{2}L\big{)} and L:=Nk2MGL:=Nk^{2}MG.

Lemma 3.2.

If Ck2MGC\geq k^{2}MG, then 𝐌𝐒𝐀(S,C)=𝐍𝐌𝐒𝐀-z(SL,Cz)=Yes\mathbf{MSA}(S,C)=\mathbf{NMSA}\text{-}z(S^{L},C^{z})=\textsf{Yes}, for each z=1,2,3z=1,2,3.

Proof.

Suppose that Ck2MGC\geq k^{2}MG. Let AA be an alignment of SS. From Proposition 3.1, vSPγ[A]=vSPγσ[AL]k2MGCv{\mathrm{SP}_{\!\gamma}}[A]=v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]\leq k^{2}MG\leq C which implies that 𝐌𝐒𝐀(S,C)=Yes\mathbf{MSA}(S,C)=\textsf{Yes}. Since vSPγσ[AL]Cv{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]\leq C, we have that Vγσ1[AL]=vSPγσ[AL]/LC/L=C1{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A^{L}]=v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]/L\leq C/L=C^{1}, and then 𝐍𝐌𝐒𝐀-1(SL,C1)=Yes\mathbf{NMSA}\text{-}1(S^{L},C^{1})=\textsf{Yes}. Since vSPγσ[AL]Cv{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]\leq C and |A{h,i}L|L|A_{\{h,i\}}^{L}|\geq L, we have

Vγσ2[AL]=h=1k1i=k+1kvAγσ[A{h,i}L]|A{h,i}L|h=1k1i=h+1kvAγσ[A{h,i}L]L=vSPγσ[AL]LCL=C2,{\mathrm{V}^{2}_{\!\gamma^{\sigma}}}[A^{L}]=\sum_{h=1}^{k-1}\sum_{i=k+1}^{k}\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{L}]}{\big{|}A_{\{h,i\}}^{L}\big{|}}\leq\frac{\displaystyle\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{L}]}{L}=\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{L}\leq\frac{C}{L}=C^{2}\,,

and thus 𝐍𝐌𝐒𝐀-2(SL,C2)=Yes\mathbf{NMSA}\text{-}2(S^{L},C^{2})=\textsf{Yes}. Again, since vSPγσ[AL]Cv{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]\leq C and |A{h,i}L|L|A_{\{h,i\}}^{L}|\geq L, we have

Vγσ3[AL]=vSPγσ[AL]h=1k1i=k+1k|A{h,i}L|vSPγσ[AL]h=1k1i=k+1kL=vSPγσ[AL](k2)LC(k2)L=C3,{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A^{L}]=\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{\displaystyle\sum_{h=1}^{k-1}\sum_{i=k+1}^{k}\big{|}A_{\{h,i\}}^{L}\big{|}}\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{\displaystyle\sum_{h=1}^{k-1}\sum_{i=k+1}^{k}L}=\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{\binom{k}{2}L}\leq\frac{C}{\binom{k}{2}L}=C^{3}\,,

and then 𝐍𝐌𝐒𝐀-3(SL,C3)=Yes\mathbf{NMSA}\text{-}3(S^{L},C^{3})=\textsf{Yes}.

Therefore, if Ck2MGC\geq k^{2}MG, then 𝐌𝐒𝐀(S,C)=𝐍𝐌𝐒𝐀-z(SL,Cz)=Yes\mathbf{MSA}(S,C)=\mathbf{NMSA}\text{-}z(S^{L},C^{z})=\textsf{Yes} for each zz. ∎

Lemma 3.3.

There exists a canonical alignment of SLS^{L} which is Vγσz{\mathrm{V}^{z}_{\!\gamma^{\sigma}}}-optimal for each zz.

Proof.

Suppose by contradiction that any canonical alignment of SLS^{L} is not Vγσz{\mathrm{V}^{z}_{\!\gamma^{\sigma}}}-optimal. Let A=[s1,,sk]A=[s_{1}^{\prime},\ldots,s_{k}^{\prime}] be a Vγσz{\mathrm{V}^{z}_{\!\gamma^{\sigma}}}-optimal alignment of SLS^{L} with maximum tail length and maximum number of σ\sigma in the tail base. Note that, by hypothesis, AA is not canonical.

Let qq be the index of the tail base of AA. Since AA is not canonical, the column qq contains only symbols - and σ\sigma. Let pp be the greatest index such that p<qp<q and there exists an integer ii where si(p)=σs_{i}^{\prime}(p)=\sigma and si(q)=-s_{i}^{\prime}(q)=\texttt{-}. Let A=[s1′′,,sk′′]A^{\prime}=[s_{1}^{\prime\prime},\ldots,s_{k}^{\prime\prime}] be an alignment of SLS^{L} such that AA^{\prime} is almost the same as AA, except for columns pp and qq, that are defined as following. For each hh, we have

sh′′={sh,if sh(p)σ or sh(q)-,sh(1:p1)-sh(p+1:q1)σsh(q+1:|sh|),otherwise.s_{h}^{\prime\prime}=\left\{\begin{array}[]{ll}s_{h}^{\prime}\,,&\text{if $s_{h}^{\prime}(p)\neq\sigma$~\text{or}~$s_{h}^{\prime}(q)\neq\texttt{-}$}\,,\\ s_{h}^{\prime}(1\!:\!p-1)\cdot\texttt{-}\cdot s_{h}^{\prime}(p+1\!:\!q-1)\cdot\sigma\cdot s_{h}^{\prime}(q+1\!:\!\left|s_{h}^{\prime}\right|)\,,&\text{otherwise}\,.\end{array}\right.

Observe that either the tail length of AA^{\prime} is greater than the tail length of AA or the tail lengths of AA and AA^{\prime} are the same but the number of σ\sigma in the tail base of AA^{\prime} is greater than this number in AA. Thus, by the choice of AA, the alignment AA^{\prime} is not Vγσz{\mathrm{V}^{z}_{\!\gamma^{\sigma}}}-optimal.

Let h,ih,i be integers. We classify the induced alignment A{h,i}A_{\{h,i\}} of AA as follows:

  • Type 1: if vAγσ[A{h,i}]=vAγσ[A{h,i}]v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]=v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}] and |A{h,i}|=|A{h,i}||A_{\{h,i\}}|=|A_{\{h,i\}}^{\prime}| ;

  • Type 2: if vAγσ[A{h,i}]vAγσ[A{h,i}]v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]\neq v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}] and |A{h,i}|=|A{h,i}||A_{\{h,i\}}|=|A_{\{h,i\}}^{\prime}|. In this case, the only possibility is that in one of the sequences, say shs_{h}^{\prime}, is such that sh(p)=σs_{h}^{\prime}(p)=\sigma and sh(q)=-s_{h}^{\prime}(q)=\texttt{-}, and the other, sis_{i}^{\prime}, is such that si(p)=xs_{i}^{\prime}(p)=x and si(q)=σs_{i}^{\prime}(q)=\sigma, where xΣx\in\Sigma. By hypothesis, Gγx-=γ-xG\geq{\gamma}_{{x}\rightarrow{\texttt{-}}}={\gamma}_{{\texttt{-}}\rightarrow{x}}. Then,

    vAγσ[A{h,i}]=vAγσ[A{h,i}]G+γx-(=γ-x)vAγσ[A{h,i}].v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]=v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-G+{\gamma}_{{x}\rightarrow{\texttt{-}}}(={\gamma}_{{\texttt{-}}\rightarrow{x}})\leq v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]\,.

    Therefore, vAγσ[A{h,i}]vAγσ[A{h,i}]v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]\leq v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}] ;

  • Type 3: if |A{h,i}||A{h,i}||A_{\{h,i\}}|\neq|A_{\{h,i\}}^{\prime}|. In this case, the only possibility is that in one of the sequences, say shs_{h}^{\prime}, is such that sh(p)=σs_{h}^{\prime}(p)=\sigma and sh(q)=-s_{h}^{\prime}(q)=\texttt{-}, and the other, sis_{i}^{\prime}, is such that si(p)=-s_{i}^{\prime}(p)=\texttt{-} and si(q)=σs_{i}^{\prime}(q)=\sigma. It follows that

    vAγσ[A{h,i}]=vAγσ[A{h,i}]2Gand|A{h,i}|=|A{h,i}|+1.v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]=v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-2G\qquad\text{and}\qquad|A_{\{h,i\}}|=|A_{\{h,i\}}^{\prime}|+1\,.

We consider now the case z=1z=1. Suppose that |A|=|A|\left|A^{\prime}\right|=\left|A\right|. Analyzing the types above, vAγσ[A{h,i}]vAγσ[A{h,i}]v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]\leq v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}], which implies that vSPγσ[A]vSPγσ[A]v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{\prime}]\leq v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A] and Vγσ1(A)=vSPγσ[A]/|A|vSPγσ[A]/|A|=Vγσ1[A]{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}(A^{\prime})=v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{\prime}]/\left|A^{\prime}\right|\leq v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]/\left|A\right|={\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A], which contradicts AA^{\prime} is not Vγσ1{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}-optimal. Then, we assume |A||A|\left|A^{\prime}\right|\neq\left|A\right| which implies that |A|=|A|1\left|A^{\prime}\right|=\left|A\right|-1 and at least one alignment A{h,i}A_{\{h,i\}} is of type 3, meaning that vAγσ[A{h,i}]=vAγσ[A{h,i}]2Gv{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]=v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-2G. It follows that vSPγσ[A]vSPγσ[A]2Gv{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{\prime}]\leq v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]-2G. Then,

Vγσ1[A]=vSPγσ[A]|A|vSPγσ[A]2G|A|1.{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A^{\prime}]=\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{\prime}]}{\left|A^{\prime}\right|}\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]-2G}{\left|A\right|-1}\,. (14)

Let BB be a canonical alignment. By Proposition 3.1, we have that vSPγσ[B]k2MGv{\mathrm{SP}_{\!\gamma^{\sigma}}}[B]\leq k^{2}MG. By the choice of AA, we have Vγσ1[A]Vγσ1[B]{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A]\leq{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[B]. Since G1G\geq 1 and |B|L=Nk2MG\left|B\right|\geq L=Nk^{2}MG, then

Vγσ1[A]Vγσ1[B]=vSPγσ[B]|B|k2MGNk2MG=1NG.{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A]\leq{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[B]=\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[B]}{\left|B\right|}\leq\frac{k^{2}MG}{Nk^{2}MG}=\frac{1}{N}\leq G\,.

Since vSPγσ[A]/|A|Gv{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]/\left|A\right|\leq G, we have (vSPγσ[A]G)/(|A|1)vSPγσ[A]/|A|(v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]-G)/(\left|A\right|-1)\leq v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]/\left|A\right| which implies, by equation (14), G1G\geq 1 and by the definition of Vγσ1[A]{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A], that

Vγσ1[A]vSPγσ[A]2G|A|1vSPγσ[A]G|A|1vSPγσ[A]|A|=Vγσ1[A],{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A^{\prime}]\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]-2G}{\left|A\right|-1}\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]-G}{\left|A\right|-1}\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]}{\left|A\right|}={\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A]\,,

which contradicts again that AA^{\prime} is not Vγσ1{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}-optimal. Thus, there exists a canonical alignment of SLS^{L} which is Vγσ1{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}-optimal.

Now, we consider the case z=2z=2. If an induced alignment A{h,i}A_{\{h,i\}} is of type 1 or 2, then vAγσ[A{h,i}]vAγσ[A{h,i}]v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]\leq v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}] and |A{h,i}|=|A{h,i}||A_{\{h,i\}}^{\prime}|=|A_{\{h,i\}}|, which implies that

vNγσ[A{h,i}]=vAγσ[A{h,i}]|A{h,i}|vAγσ[A{h,i}]|A{h,i}|=vNγσ[A{h,i}].v{\mathrm{N}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]=\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]}{|A_{\{h,i\}}^{\prime}|}\leq\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]}{|A_{\{h,i\}}|}=v{\mathrm{N}_{\gamma^{\sigma}}}[A_{\{h,i\}}]\,. (15)

If A{h,i}A_{\{h,i\}} is of type 3, then vAγσ[A{h,i}]=vAγσ[A{h,i}]2Gv{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]=v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-2G and |A{h,i}|=|A{h,i}|1|A_{\{h,i\}}^{\prime}|=|A_{\{h,i\}}|-1 which implies that

vNγσ[A{h,i}]=vAγσ[A{h,i}]|A{h,i}|=vAγσ[A{h,i}]2G|A{h,i}|1vAγσ[A{h,i}]G|A{h,i}|1vAγσ[A{h,i}]|A{h,i}|vNγσ[A{h,i}],v{\mathrm{N}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]=\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]}{|A_{\{h,i\}}^{\prime}|}=\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-2G}{|A_{\{h,i\}}|-1}\leq\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-G}{|A_{\{h,i\}}|-1}\leq\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]}{|A_{\{h,i\}}|}\leq v{\mathrm{N}_{\gamma^{\sigma}}}[A_{\{h,i\}}]\,, (16)

where the first inequality is a consequence of G1G\geq 1 and the second, since GG is the maximum value in γσ\gamma^{\sigma} and therefore GG is an upper bound to vNγσ[A{h,i}]v{\mathrm{N}_{\gamma^{\sigma}}}[A_{\{h,i\}}], is a consequence of vAγσ[A{h,i}]/|A{h,i}|Gv{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]/|A_{\{h,i\}}|\leq G. As a consequence of equations (15) and (16) we have that Vγσ2[A]Vγσ2[A]{\mathrm{V}^{2}_{\!\gamma^{\sigma}}}[A^{\prime}]\leq{\mathrm{V}^{2}_{\!\gamma^{\sigma}}}[A] contradicting the assumption that AA^{\prime} is not Vγσ2{\mathrm{V}^{2}_{\!\gamma^{\sigma}}}-optimal. Thus, there exists a canonical alignment of SLS^{L} which is Vγσ2{\mathrm{V}^{2}_{\!\gamma^{\sigma}}}-optimal.

Finally, we show the case when z=3z=3. We denote TjT_{j} the set of all pairs (h,i)(h,i) such that A{h,i}A_{\{h,i\}} is of type jj. Recall that each induced alignment A{h,i}A_{\{h,i\}} of types 1 and 2 are such that vAγσ[A{h,i}]vAγσ[A{h,i}]v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]\leq v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}] and |A{h,i}|=|A{h,i}||A_{\{h,i\}}^{\prime}|=|A_{\{h,i\}}|. Thus, the total contribution of the induced alignments of types 1 and 2 to the Vγσ3{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}-score is

(h,i)T1T2vAγσ[A{h,i}](h,i)T1T2vAγσ[A{h,i}]and(h,i)T1T2|A{h,i}|=(h,i)T1T2|A{h,i}|.\sum_{\mathclap{(h,i)\in T_{1}\cup T_{2}}}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]\,\leq\quad\sum_{\mathclap{(h,i)\in T_{1}\cup T_{2}}}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]\qquad\text{and}\qquad\sum_{\mathclap{(h,i)\in T_{1}\cup T_{2}}}|A_{\{h,i\}}^{\prime}|\,=\quad\sum_{\mathclap{(h,i)\in T_{1}\cup T_{2}}}|A_{\{h,i\}}|\,.

And since each alignment A{h,i}A_{\{h,i\}} of type 3 is such that vAγσ[A{h,i}]=vAγσ[A{h,i}]2Gv{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]=v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-2G and |A{h,i}|=|A{h,i}|1|A_{\{h,i\}}^{\prime}|=|A_{\{h,i\}}|-1, we have

(h,i)T3vAγσ[A{h,i}]=(h,i)T3(vAγσ[A{h,i}]2G)and(h,i)T3|A{h,i}|=(h,i)T3(|A{h,i}|1).\sum_{\mathclap{(h,i)\in T_{3}}}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]\;=\;\sum_{\mathclap{(h,i)\in T_{3}}}\big{(}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-2G\big{)}\qquad\text{and}\qquad\sum_{\mathclap{(h,i)\in T_{3}}}|A_{\{h,i\}}^{\prime}|\;=\;\sum_{\mathclap{(h,i)\in T_{3}}}\big{(}|A_{\{h,i\}}|-1\big{)}\,.

It follows that

Vγσ3[A]\displaystyle{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A^{\prime}] =vSPγσ[A]|A|=(h,i)T1T2vAγσ[A{h,i}]+(h,i)T3vAγσ[A{h,i}](h,i)T1T2|A{h,i}|+(h,i)T3|A{h,i}|(h,i)T1T2vAγσ[A{h,i}]+(h,i)T3(vAγσ[A{h,i}]2G)(h,i)T1T2|A{h,i}|+(h,i)T3(|A{h,i}|1)\displaystyle=\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{\prime}]}{\left|A^{\prime}\right|}\;=\quad\frac{\displaystyle\sum_{\mathclap{(h,i)\in T_{1}\cup T_{2}}}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]+\sum_{\mathclap{(h,i)\in T_{3}}}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{\prime}]}{\displaystyle\sum_{\mathclap{(h,i)\in T_{1}\cup T_{2}}}|A_{\{h,i\}}^{\prime}|+\sum_{\mathclap{(h,i)\in T_{3}}}|A_{\{h,i\}}^{\prime}|}\leq\;\quad\frac{\displaystyle\sum_{\mathclap{(h,i)\in T_{1}\cup T_{2}}}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]+\sum_{\mathclap{(h,i)\in T_{3}}}\big{(}v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}]-2G\big{)}}{\displaystyle\sum_{\mathclap{(h,i)\in T_{1}\cup T_{2}}}|A_{\{h,i\}}|+\sum_{\mathclap{(h,i)\in T_{3}}}\big{(}|A_{\{h,i\}}|-1\big{)}}
=vSPγσ[A]2|T3|G(h=1k1i=h+1k|A{h,i}|)|T3|vSPγσ[A]|T3|G(h=1k1i=h+1k|A{h,i}|)|T3|vSPγσ[A]h=1k1i=h+1k|A{h,i}|=Vγσ3[A],\displaystyle=\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]-2\left|T_{3}\right|G}{\left(\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\left|A_{\{h,i\}}\right|\right)-\left|T_{3}\right|}\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]-\left|T_{3}\right|G}{\left(\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\left|A_{\{h,i\}}\right|\right)-\left|T_{3}\right|}\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]}{\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\left|A_{\{h,i\}}\right|}={\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A]\,,

where the second inequality is a consequence of G1G\geq 1, and the last inequality is a consequence of vSPγσ[A]/|A|Gv{\mathrm{SP}_{\!\gamma^{\sigma}}}[A]/\left|A\right|\leq G since GG is the maximum value in γσ\gamma^{\sigma} and then GG is an upper bound to Vγσ3[A]{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A]. Thus, Vγσ3[A]Vγσ3[A]{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A^{\prime}]\leq{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A], which contradicts the assumption that AA^{\prime} is not Vγσ3{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}-optimal. Therefore, there exists a canonical alignment of SLS^{L} which is Vγσ3{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}-optimal. ∎

Theorem 3.4.

𝐍𝐌𝐒𝐀-z\mathbf{NMSA}\text{-}z is NP-complete for each zz.

Proof.

Given a kk-sequence SS, an alignment AA of SS and a integer CC, it is easy to check in polynomial time on the length of AA that Vγz[A]C{\mathrm{V}^{z}_{\!\gamma}}[A]\leq C for z{1,2,3}z\in\{1,2,3\}. Then, 𝐍𝐌𝐒𝐀-z\mathbf{NMSA}\text{-}z is in NP.

Now, we prove that 𝐌𝐒𝐀(S,C)=Yes\mathbf{MSA}(S,C)=\textsf{Yes} if and only if 𝐍𝐌𝐒𝐀-z(SL,Cz)=Yes\mathbf{NMSA}\text{-}z(S^{L},C^{z})=\textsf{Yes} for each z{1,2,3}z\in\{1,2,3\}. If Ck2MGC\geq k^{2}MG, from Lemma 3.2 the theorem is proved. Thus, we assume C<k2MGC<k^{2}MG.

Suppose that 𝐌𝐒𝐀(S,C)=Yes\mathbf{MSA}(S,C)=\textsf{Yes} and, hence, there exists an alignment AA such that vSPγ[A]Cv{\mathrm{SP}_{\!\gamma}}[A]\leq C:

Vγσ1[AL]\displaystyle\!\!\!\!\!\!{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A^{L}] =vSPγσ[AL]|AL|vSPγσ[AL]L=vSPγ[A]LCL=C1,\displaystyle\!=\!\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{|A^{L}|}\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{L}=\frac{v{\mathrm{SP}_{\!\gamma}}[A]}{L}\leq\frac{C}{L}=C^{1}\,,
Vγσ2[AL]\displaystyle\!\!\!\!\!\!{\mathrm{V}^{2}_{\!\gamma^{\sigma}}}[A^{L}] =h=1k1i=k+1kvAγσ[A{h,i}L]|A{h,i}L|h=1k1i=h+1kvAγσ[A{h,i}L]L=vSPγσ[AL]L=vSPγ[A]LCL=C2,\displaystyle\!=\!\sum_{h=1}^{k-1}\sum_{i=k+1}^{k}\!\!\!\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{L}]}{|A_{\{h,i\}}^{L}|}\!\leq\!\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\!\!\!\frac{v{\mathrm{A}_{\gamma^{\sigma}}}[A_{\{h,i\}}^{L}]}{L}\!=\!\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{L}\!=\!\frac{v{\mathrm{SP}_{\!\gamma}}[A]}{L}\!\leq\!\frac{C}{L}\!=\!C^{2},
Vγσ3[AL]\displaystyle\!\!\!\!\!\!{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A^{L}] =vSPγσ[AL]h=1k1i=k+1k|A{h,i}L|vSPγσ[AL]h=1k1i=k+1kL=vSPγσ[AL](k2)L=vSPγ[A](k2)LC(k2)L=C3,\displaystyle\!=\!\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{\displaystyle\sum_{h=1}^{k-1}\sum_{i=k+1}^{k}|A_{\{h,i\}}^{L}|}\leq\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{\displaystyle\sum_{h=1}^{k-1}\sum_{i=k+1}^{k}L}=\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{\binom{k}{2}L}=\frac{v{\mathrm{SP}_{\!\gamma}}[A]}{\binom{k}{2}L}\leq\frac{C}{\binom{k}{2}L}=C^{3}\,,

where the first inequality in each equation follows from either ALA^{L} or each alignment induced by ALA^{L} has length at least LL and the second inequality follows from vSPγ[A]Cv{\mathrm{SP}_{\!\gamma}}[A]\leq C. Thus, if 𝐌𝐒𝐀(S,C)=Yes\mathbf{MSA}(S,C)=\textsf{Yes} then 𝐍𝐌𝐒𝐀-z(SL,Cz)=Yes\mathbf{NMSA}\text{-}z(S^{L},C^{z})=\textsf{Yes}.

Conversely, suppose that 𝐍𝐌𝐒𝐀-z(SL,Cz)=Yes\mathbf{NMSA}\text{-}z(S^{L},C^{z})=\textsf{Yes}. It follows from Lemma 3.3 that there exists a canonical alignment ALA^{L} of SLS^{L} such that Vγσz[AL]Cz{\mathrm{V}^{z}_{\!\gamma^{\sigma}}}[A^{L}]\leq C^{z} for each zz. Thus, considering Vγσ1[AL]C1{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A^{L}]\leq C^{1}, we have

vSPγ[A]\displaystyle v{\mathrm{SP}_{\!\gamma}}[A] =vSPγσ[AL]=(N+L)vSPγσ[AL]N+L(N+L)vSPγσ[AL]|AL|=(N+L)Vγσ1[AL]\displaystyle=v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]=(N+L)\,\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{N+L}\leq(N+L)\,\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{|A^{L}|}=(N+L)\,{\mathrm{V}^{1}_{\!\gamma^{\sigma}}}[A^{L}]
(N+L)C1=(N+L)CL=NCL+C<Nk2MGL+C=1+C,\displaystyle\leq(N+L)\,C^{1}=(N+L)\,\frac{C}{L}=\frac{NC}{L}+C<\frac{Nk^{2}MG}{L}+C=1+C\,,

where the first equality holds since ALA^{L} is canonical, the first inequality holds since |AL|N+L|A^{L}|\leq N+L and the second and the third inequalities hold by hypothesis. Considering Vγσ2[AL]C2{\mathrm{V}^{2}_{\!\gamma^{\sigma}}}[A^{L}]\leq C^{2}, we have

vSPγ[A]\displaystyle v{\mathrm{SP}_{\!\gamma}}[A] =vSPγσ[AL]=(N+L)vSPγσ[AL]N+L=(N+L)h=1k1i=h+1kvAγσ[A{h,i}L]N+L\displaystyle=v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]=(N+L)\,\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{N+L}=(N+L)\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\frac{v{\mathrm{A}_{\!\gamma^{\sigma}}}[A_{\{h,i\}}^{L}]}{N+L}
(N+L)h=1k1i=h+1kvAγσ[A{h,i}L]|A{h,i}L|=(N+L)Vγσ2[AL]\displaystyle\leq(N+L)\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\frac{v{\mathrm{A}_{\!\gamma^{\sigma}}}[A_{\{h,i\}}^{L}]}{|A_{\{h,i\}}^{L}|}=(N+L)\,{\mathrm{V}^{2}_{\!\gamma^{\sigma}}}[A^{L}]
(N+L)C2=(N+L)CL=NCL+C<Nk2MGL+C=1+C,\displaystyle\leq(N+L)\,C^{2}=(N+L)\,\frac{C}{L}=\frac{NC}{L}+C<\frac{Nk^{2}MG}{L}+C=1+C\,,

where the first equality holds since ALA^{L} is canonical, the first inequality holds since, for each h,ih,i, |A{h,i}L|N+L|A_{\{h,i\}}^{L}|\leq N+L, and the second and the third inequalities hold by hypothesis. And finally, considering Vγσ3[AL]C3{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A^{L}]\leq C^{3}, we have

vSPγ[A]\displaystyle v{\mathrm{SP}_{\!\gamma}}[A] =vSPγσ[AL]=(N+(k2)L)vSPγσ[AL]N+(k2)L\displaystyle=v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]=\big{(}N+{\textstyle\binom{k}{2}}L\big{)}\,\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{N+{\textstyle\binom{k}{2}}L}
(N+(k2)L)vSPγσ[AL]h=1k1i=h+1k|A{h,i}L|=(N+(k2)L)Vγσ3[AL](N+(k2)L)C3\displaystyle\leq\big{(}N+{\textstyle\binom{k}{2}}L\big{)}\,\frac{v{\mathrm{SP}_{\!\gamma^{\sigma}}}[A^{L}]}{\displaystyle\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\big{|}A_{\{h,i\}}^{L}\big{|}}=\big{(}N+{\textstyle\binom{k}{2}}L\big{)}\,{\mathrm{V}^{3}_{\!\gamma^{\sigma}}}[A^{L}]\leq\big{(}N+{\textstyle\binom{k}{2}}L\big{)}\,C^{3}
=(N+(k2)L)C(k2)L=NC(k2)L+C<Nk2MG(k2)L+C=1(k2)+C1+C,\displaystyle=\big{(}N+{\textstyle\binom{k}{2}}L\big{)}\,\frac{C}{\binom{k}{2}L}=\frac{NC}{\binom{k}{2}L}+C<\frac{Nk^{2}MG}{\binom{k}{2}L}+C=\frac{1}{\binom{k}{2}}+C\leq 1+C\,,

where the first equality holds since ALA^{L} is canonical, the first inequality holds since the sum of lengths of two sequences induced by a canonical alignment is at most N+(k2)LN+\binom{k}{2}L and the second and the third inequalities hold by hypothesis. Therefore, if 𝐍𝐌𝐒𝐀-z(SL,Cz)=Yes\mathbf{NMSA}\text{-}z(S^{L},C^{z})=\textsf{Yes} then vSPγ[A]<1+Cv{\mathrm{SP}_{\!\gamma}}[A]<1+C for any z{1,2,3}z\in\{1,2,3\}. Since the entries in the scoring matrix are integers, we have that vSPγ[A]v{\mathrm{SP}_{\!\gamma}}[A] is an integer. And since CC is an integer, it follows that vSPγ[A]Cv{\mathrm{SP}_{\!\gamma}}[A]\leq C.

4 Exact algorithms

In the following sections we describe exact dynamic programming algorithms for 𝐍𝐌𝐒𝐀-z\mathbf{NMSA}\text{-}z, with z=1,2,3z=1,2,3.

4.1 𝐍𝐌𝐒𝐀-1\mathbf{NMSA}\text{-}1

Let S=s1,,skS=s_{1},\ldots,s_{k} be a kk-sequence and A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}] be an alignment of SS. As defined in Equation (9), Vγ1[A]{\mathrm{V}^{1}_{\!\gamma}}[A] takes into account the length of AA, and the optimal function is given by optVγ1(S)=minA𝒜S{Vγ1[A]}\mathrm{opt}{\mathrm{V}^{1}_{\!\gamma}}(S)=\min_{A\in\mathcal{A}_{S}}\big{\{}{\mathrm{V}^{1}_{\!\gamma}}[A]\big{\}}. In the optimization version of 𝐍𝐌𝐒𝐀-1\mathbf{NMSA}\text{-}1, we are given a kk-sequence SS and we want to compute optVγ1(S)\mathrm{opt}{\mathrm{V}^{1}_{\!\gamma}}(S) for a fixed matrix γ\gamma. We can solve 𝐍𝐌𝐒𝐀-1\mathbf{NMSA}\text{-}1 by calculating the minimum SP-score considering every possible length of an alignment. That is, we compute the entries of a table DD indexed by Vn×{0,1,,N}V_{\vec{n}}\times\{0,1,\ldots,N\}, where n=[|s1|,,|sk|]\vec{n}=[\left|s_{1}\right|,\ldots,\left|s_{k}\right|] and N=i=1k|si|N=\sum_{i=1}^{k}|s_{i}|. The entry D(v,L)D(\vec{v},L) stores the score of an alignment of S(v)S(\vec{v}) of length LL with minimum SP-score. Notice that D(0,0)=0D(\vec{0},0)=0, D(v0,0)=D(0,L0)=D(\vec{v}\not=\vec{0},0)=D(\vec{0},L\not=0)=\infty. Therefore, the table entries can be calculated as:

D(v,L)={0,​​​​if v=0,L=0,,​​​​if v=0,L0 or v0,L=0,minbk,bv{D(vb,L1)+vSPγ[bS(v)]},​​​​otherwise.D(\vec{v},L)\!=\!\left\{\!\!\begin{array}[]{ll}\!0\,,&\mbox{\!\!\!\!if $\vec{v}\!=\!\vec{0},L\!=\!0$}\,,\\ \!\infty\,,&\mbox{\!\!\!\!if $\vec{v}\!=\!\vec{0},L\!\not=\!0$ or $\vec{v}\!\not=\!\vec{0},L\!=\!0$}\,,\\ \!\min_{\vec{b}\in\mathcal{B}_{k},\vec{b}\leq\vec{v}}\big{\{}D(\vec{v}-\vec{b},L-1)+v{\mathrm{SP}_{\!\gamma}}[\vec{b}\cdot S(\vec{v})]\big{\}},&\mbox{\!\!\!\!otherwise}\,.\end{array}\right.

Table DD is computed for all possible values of L=0,,NL=0,\ldots,N. Consequently, optVγ1(S)=minL{D(n,L)/L}\mathrm{opt}{\mathrm{V}^{1}_{\!\gamma}}(S)=\min_{L}\left\{D(\vec{n},L)/L\right\} is returned. Algorithm 2 describes this procedure more precisely.

Algorithm 2
0:kk-sequence S=s1,,skS=s_{1},\ldots,s_{k} such that ni=|si|n_{i}=\left|s_{i}\right|
0:optVγ1(S)\mathrm{opt}{\mathrm{V}^{1}_{\!\gamma}}(S)
1:D(0,0)0D(\vec{0},0)\leftarrow 0
2:for each L0L\not=0 do D(0,L)D(\vec{0},L)\leftarrow\infty
3:for each v0\vec{v}\not=\vec{0} do D(v,0)D(\vec{v},0)\leftarrow\infty
4:for each 0<vn\vec{0}<\vec{v}\leq\vec{n} in lexicographical order do
5:  for each L1,2,,NL\leftarrow 1,2,\ldots,N do
6:   D(v,L)minbk,bv{D(vb,L1)+vSPγ[bS(v)]}D(\vec{v},L)\leftarrow\min_{\vec{b}\in\mathcal{B}^{{k}},\vec{b}\leq\vec{v}}\big{\{}D(\vec{v}-\vec{b},L-1)+v{\mathrm{SP}_{\!\gamma}}[\vec{b}\cdot S(\vec{v})]\big{\}}
7:return  minL{D(n,L)/L}\min_{L}\big{\{}D(\vec{n},L)/L\big{\}}

Suppose that ni=|si|=nn_{i}=\left|s_{i}\right|=n for each ii which implies that |Vn|=(n+1)k\left|V_{n}\right|=(n+1)^{k} and N=nkN=nk. Notice that the space to store the matrix DD is Θ(N|Vn|=kn(n+1)k)\Theta(N\cdot\left|V_{n}\right|=kn\cdot(n+1)^{k}). The time consumption of Algorithm 2 corresponds to the time needed to fill the table DD up, plus the running time of line 7. Each entry of DD can be computed in O((2k1)(k2))=O(2kk2)O((2^{k}-1)\cdot{k\choose 2})=O(2^{k}k^{2})-time. Therefore, the algorithm spends O(2kk2kn(n+1)k)=O(2kk3(n+1)k+1)O(2^{k}k^{2}\cdot kn(n+1)^{k})=O(2^{k}k^{3}(n+1)^{k+1})-time to compute the entire table DD. Line 7 is computed in Θ(N=kn)\Theta(N=kn)-time. Therefore, the running time of Algorithm 2 is O(2kk3(n+1)k+1)+Θ(N)=O(2kk3(n+1)k+1)O(2^{k}k^{3}(n+1)^{k+1})+\Theta(N)=O(2^{k}k^{3}(n+1)^{k+1}).

4.2 𝐍𝐌𝐒𝐀-2\mathbf{NMSA}\text{-}2

Let S=s1,,skS=s_{1},\ldots,s_{k} be a kk-sequence and A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}] be an alignment of SS. As defined in Equation (10), Vγ2[A]{\mathrm{V}^{2}_{\!\gamma}}[A] takes into account the lengths of the induced alignments in AA. In the optimization version of 𝐍𝐌𝐒𝐀-2\mathbf{NMSA}\text{-}2, we are given a kk-sequence SS and we want to compute optVγ2(S)\mathrm{opt}{\mathrm{V}^{2}_{\!\gamma}}(S) for a fixed scoring matrix γ\gamma.

Let L=[L12,,Lhi,,L(k1)k]\vec{L}=[L_{12},\ldots,L_{hi},\ldots,L_{(k-1)k}] be a (k2){k\choose 2}-vector indexed by the set of pairs of integers {h,i}\{h,i\} such that 1h<ik1\leq h<i\leq k and LhiL_{hi} denotes the element of L\vec{L} of index {h,i}\{h,i\}. The lengths of the induced alignments can be represented by a vector L\vec{L}. Thus, if AA is an alignment and |A{h,i}|=Lhi|A_{\{h,i\}}|=L_{hi} for each pair h,ih,i, we say that L\vec{L} is the induced length of AA. For a kk-sequence S=s1,,skS=s_{1},\ldots,s_{k}, where ni=|si|n_{i}=\left|s_{i}\right| for each ii, we define

𝕃={L=[L12,L13,,L1k,L23,L2k,,L(k1)k]:0Lhinh+ni}.\mathbb{L}=\big{\{}\vec{L}=[{L_{12},L_{13},\ldots,L_{1k},L_{23},\ldots L_{2k},\ldots,L_{(k-1)k}}]:0\leq L_{hi}\leq n_{h}+n_{i}\big{\}}.

Therefore, 𝕃\mathbb{L} contains the induced length of alignment AA of SS for all A𝒜SA\in\mathcal{A}_{S}. Note that if nn is the length of each sequence in SS, then |𝕃|=(2n+1)(k2)\left|\mathbb{L}\right|=(2n+1)^{k\choose 2}. Let b=[b1,,bk]\vec{b}=[b_{1},\ldots,b_{k}] be a kk-vector of bits. Overloading the minus operator “-”, we define Lb\vec{L}-\vec{b} to be a (k2){k\choose 2}-vector L\vec{L^{\prime}} such that

Lhi={Lhi,if bh=bi=0,Lhi1,otherwise.L^{\prime}_{hi}=\left\{\begin{array}[]{ll}L_{hi}\,,&\mbox{if $b_{h}=b_{i}=0$}\,,\\ L_{hi}-1\,,&\mbox{otherwise}\,.\end{array}\right.

Observe that if L\vec{L} is the induced length of an alignment AA of S(v)S(\vec{v}) and b\vec{b} is a kk-vector of bits such that bS(v)\vec{b}\cdot S(\vec{v}) is the last column of AA, then L=Lb\vec{L^{\prime}}=\vec{L}-\vec{b} is the induced length of the alignment A(1:|A|1)A(1\!:\!\left|A\right|-1).

From a (k2)k\choose 2-vector L\vec{L} and a scoring matrix γ\gamma, we can define an array of (k2)k\choose 2 scoring matrices γ=γ×L=[,γ(hi),]\vec{\gamma}=\gamma\times\vec{L}=[\ldots,\gamma^{(hi)},\ldots] indexed by {1,k}×{1,,k}\{1,\ldots k\}\times\{1,\ldots,k\} such that

γ(hi)𝚊𝚋=γ𝚊𝚋Lhi,{\gamma^{(hi)}}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}=\frac{{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}}{L_{hi}},

for each 𝚊,𝚋Σ-\mathtt{a},\mathtt{b}\in\Sigma_{\texttt{-}}. Observe that

Vγ2[A]=h=1k1i=h+1k(vNγ[A{h,i}](=vAγ[A{h,i}]|A{h,i}|))=vAγ(hi)[A{h,i}]=vSPγ[A]{\mathrm{V}^{2}_{\!\gamma}}[A]=\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\left(v{\mathrm{N}_{\gamma}}[A_{\{h,i\}}](=\frac{v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]}{\left|A_{\{h,i\}}\right|})\right)=v{\mathrm{A}_{\gamma^{(hi)}}}[A_{\{h,i\}}]=v{\mathrm{SP}_{\!\vec{\gamma}}}[A]

where γ=γ×[|A{1,2}|,,|A{k1,k}|]\vec{\gamma}=\gamma\times[\left|A_{\{1,2\}}\right|,\ldots,\left|A_{\{k-1,k\}}\right|]. Besides, if \vec{\mathcal{L}} is the induced length of a Vγ2{\mathrm{V}^{2}_{\!\gamma}}-optimal alignment of SS, then we can compute optVγ2\mathrm{opt}{\mathrm{V}^{2}_{\!\gamma}} through the recurrence

D(v,L)={0,if v=0 and L=0,,if v=0 and L0,,if v0 and L=0,minbk,bv,bL{D(vb,Lb)+vSPγ[bS(v)]},otherwise,\displaystyle D_{\vec{\mathcal{L}}}(\vec{v},\vec{L})=\left\{\begin{array}[]{ll}0\,,&\mbox{if $\vec{v}=\vec{0}$ and $\vec{L}=\vec{0}$}\,,\\ \infty\,,&\mbox{if $\vec{v}=\vec{0}$ and $\vec{L}\not=\vec{0}$}\,,\\ \infty\,,&\mbox{if $\vec{v}\not=\vec{0}$ and $\vec{L}=\vec{0}$}\,,\\ \displaystyle\min_{\vec{b}\in\mathcal{B}_{k},\vec{b}\leq\vec{v},\vec{b}\leq\vec{L}}\big{\{}D_{\vec{\mathcal{L}}}(\vec{v}-\vec{b},\vec{L}-\vec{b})+v{\mathrm{SP}_{\!\vec{\gamma}}}[\vec{b}\cdot S(\vec{v})]\big{\}}\,,&\mbox{otherwise},\end{array}\right.

where bL\vec{b}\leq\vec{L} is also an overloading, meaning that Lb0\vec{L}-\vec{b}\geq\vec{0}, and γ=γ×\vec{\gamma}=\gamma\times\vec{\mathcal{L}}. In this case, optVγ2(S)=D(v,)\mathrm{opt}{\mathrm{V}^{2}_{\!\gamma}}(S)=D_{\vec{\mathcal{L}}}(\vec{v},\vec{\mathcal{L}}).

Algorithm 3 described below finds optVγ2(S)\mathrm{opt}{\mathrm{V}^{2}_{\!\gamma}}(S).

Algorithm 3
0: A kk-sequence S=s1,,skS=s_{1},\ldots,s_{k} such that ni=|si|n_{i}=\left|s_{i}\right|
0:optVγ2(S)\mathrm{opt}{\mathrm{V}^{2}_{\!\gamma}}(S)
1:for each 𝕃\vec{\mathcal{L}}\in\vec{\mathbb{L}} do
2:  D(0,0)0D_{\vec{\mathcal{L}}}(\vec{0},\vec{0})\leftarrow 0
3:  for each L0\vec{L}\not=\vec{0} do D(0,L)D_{\vec{\mathcal{L}}}(\vec{0},\vec{L})\leftarrow\infty
4:  for each v0\vec{v}\not=\vec{0} do D(v,0)D_{\vec{\mathcal{L}}}(\vec{v},\vec{0})\leftarrow\infty
5:  γγ×\vec{\gamma}\leftarrow\gamma\times\vec{\mathcal{L}}
6:  for each 0<vn\vec{0}<\vec{v}\leq\vec{n} in lexicographical order do
7:   for each L0\vec{L}\not=\vec{0} in lexicographical order do
8:    D(v,L)=minbk,bv,bL{D(vb,Lb)+vSPγ[bS(v)]}D_{\vec{\mathcal{L}}}(\vec{v},\vec{L})=\min_{\vec{b}\in\mathcal{B}^{{k}},\vec{b}\leq\vec{v},\vec{b}\leq\vec{L}}\big{\{}D_{\vec{\mathcal{L}}}(\vec{v}-\vec{b},\vec{L}-\vec{b})+v{\mathrm{SP}_{\!\vec{\gamma}}}[\vec{b}\cdot S(\vec{v})]\big{\}}
9:return  min𝕃{D(n,)}\min_{\vec{\mathcal{L}}\in\vec{\mathbb{L}}}\{D_{\vec{\mathcal{L}}}(\vec{n},\vec{\mathcal{L}})\}

For kk sequences of length nn, Algorithm 3 needs (2n+1)(k2)(n+1)k(2n+1)^{k\choose 2}\cdot(n+1)^{k} space to store the table DD_{\vec{\mathcal{L}}}. For each of the (2n+1)(k2)(2n+1)^{k\choose 2} values 𝕃\vec{\mathcal{L}}\in\vec{\mathbb{L}}, table DD_{\vec{\mathcal{L}}} is recalculated. Since the computation of each entry takes O(2kk2)O(2^{k}k^{2})-time, the total time is

O(2kk2(2n+1)(k2)(2n+1)(k2)(n+1)k=(1+12n+1)k(2n+1)k2k2).O\left(2^{k}k^{2}\cdot(2n+1)^{k\choose 2}\cdot(2n+1)^{k\choose 2}(n+1)^{k}=\Big{(}1+\frac{1}{2n+1}\Big{)}^{k}(2n+1)^{k^{2}}k^{2}\right)\,.

Therefore, if k2n+1k\leq 2n+1, the total time is O((2n+1)k2k2))O\big{(}(2n+1)^{k^{2}}k^{2})\big{)}, since (1+1/(2n+1))k(1+1/k)ke2.72(1+1/(2n+1))^{k}\leq(1+1/k)^{k}\leq e\leq 2.72 is constant.

Existence of an alignment AA for a given L\vec{L}

Consider a 3-sequence S=s1,s2,s3S=s_{1},s_{2},s_{3} with length n=[4,3,5]\vec{n}=[4,3,5] and suppose we are interested in an alignment AA of SS with induced length equal to =[4,7,5]\vec{\mathcal{L}}=[4,7,5], i.e., |A{1,2}|=4,|A{1,3}|=7,|A{2,3}|=5\left|A_{\{1,2\}}\right|=4,\left|A_{\{1,3\}}\right|=7,\left|A_{\{2,3\}}\right|=5. Computing D(v,)D_{\vec{\mathcal{L}}}(\vec{v},\vec{\mathcal{L}}) as in previous section reveals that D(v,)=D_{\vec{\mathcal{L}}}(\vec{v},\vec{\mathcal{L}})=\infty which means that such alignment doesn’t exist. In fact, we can check that each symbol in sequence 2 is aligned with one symbol in the first and one symbol in the third sequence which implies that there are at least 3 symbols of the first sequence aligned with symbols in the third sequence. Thus, |A{1,3}|6\left|A_{\{1,3\}}\right|\leq 6. Because, computing D(v,)D_{\vec{\mathcal{L}}}(\vec{v},\vec{\mathcal{L}}) spends exponential time, it would be great if we could decide whether there exists an associated alignment with the \vec{\mathcal{L}}. before computing D(v,)D_{\vec{\mathcal{L}}}(\vec{v},\vec{\mathcal{L}}). Thus, an interesting problem arises and we define it below.

Problem 4 (Existence of an alignment that is associated with an induced length).

Given the length n\vec{n} of a kk-sequence and a (k2)k\choose 2-vector L\vec{L}, decide whether there exists an alignment AA of SS such that L\vec{L} is the induced length of AA.

We denote Problem 4 by EAIL. Considering n\vec{n} the length of the kk sequence S=s1,,skS=s_{1},\ldots,s_{k}, an alternative way to represent n\vec{n} is through a matrix of integers MM that it is indexed by {1,2,,k}\{1,2,\ldots,k\} and

M(h,i)={|sh|if h=i,|sh|+|si|LhiotherwiseM(h,i)=\left\{\begin{array}[]{ll}\left|s_{h}\right|&\mbox{if $h=i$},\\ \left|s_{h}\right|+\left|s_{i}\right|-L_{hi}&\mbox{otherwise}\end{array}\right.

i.e, M(h,i)M(h,i) is the number of symbols in shs_{h} aligned with symbols in sis_{i}. Thus, we decide whether exists a collection of sets c1,,ckc_{1},\ldots,c_{k} such that |chci|=M[h,i]\left|c_{h}\cap c_{i}\right|=M[h,i], where chc_{h} is a set of indices jj of some alignment AA of SS with A[h,j]-A[h,j]\not=\texttt{-}. This different way to see the problem is exactly another that is known as Recognizing Intersection Patterns (RIP). Notice that the instance of RIP can be obtained in linear time and no extra space from the instance of EAIL.

Thus, consider an example where n=[5,5,5]\vec{n}=[5,5,5] and L12=L13=8L_{12}=L_{13}=8 and L23=10L_{23}=10. In this case, EAIL returns Yes since the induced alignments by the following alignment with 1111 columns

[s1(1)s1(2)s1(3)s1(4)s1(5)------s2(1)s2(2)---s2(3)s2(4)s2(5)-----s3(1)s3(2)----s3(3)s3(4)s3(5)]\left[\begin{array}[]{ccccccccccc}s_{1}(1)&s_{1}(2)&s_{1}(3)&s_{1}(4)&s_{1}(5)&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}\\ s_{2}(1)&s_{2}(2)&\texttt{-}&\texttt{-}&\texttt{-}&s_{2}(3)&s_{2}(4)&s_{2}(5)&\texttt{-}&\texttt{-}&\texttt{-}\\ \texttt{-}&\texttt{-}&s_{3}(1)&s_{3}(2)&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}&s_{3}(3)&s_{3}(4)&s_{3}(5)\end{array}\right]

respects required restrictions. On the other hand, in the alternative RIP formulation of this instance, we want to find a collection of three sets for the matrix

M=123152222503205M=\begin{array}[]{c|ccc}&1&2&3\\ \hline\cr 1&5&2&2\\ 2&2&5&0\\ 3&2&0&5\end{array}

that satisfies the aforementioned property. The answer to RIP is Yes, as we can see in Figure 4.

Refer to captionc6c_{6}c8c_{8}c7c_{7}c3c_{3}c4c_{4}c2c_{2}c5c_{5}c9c_{9}c10c_{10}c11c_{11}c1c_{1}
Figure 4: An instance of RIP for which the answer is Yes.

In another example, suppose that n=[5,5,5]\vec{n}=[5,5,5], L12=L13=7L_{12}=L_{13}=7 and L23=10L_{23}=10. In this case the answer is No. To check it, suppose by contradiction that there is an alignment AA of a 33-tuple s1,s2,s3s_{1},s_{2},s_{3} for this instance. Since L23=10L_{23}=10, we have that s2s_{2} and s3s_{3} has no (5+510=05+5-10=0) aligned symbol. Since L12=7L_{12}=7, then s1s_{1} must have 5+57=35+5-7=3 symbols aligned with s2s_{2} that cannot be aligned with any symbol of s3s_{3} because s2s_{2} and s3s_{3} has no symbol aligned. On the other hand, since L13=7L_{13}=7, we have that s1s_{1} and s3s_{3} must have 33 aligned symbols. Since s1s_{1} has 3 symbols not alignment and 3 symbols aligned with s3s_{3}, it follows that s1s_{1} has at least 66 symbols, which is a contradiction.

Chvátal [Chv80] showed that for a special class of matrices MM where M[i,i]=3M[i,i]=3 for every ii, RIP is NP-complete. Therefore, we have the following result

Theorem 4.1.

EAIL is NP-complete when |s|=3\left|s\right|=3 for each sequence in kk-sequence and it is NP-hard if its lengths are arbitrary.

4.3 𝐍𝐌𝐒𝐀-3\mathbf{NMSA}\text{-}3

Let S=s1,,skS=s_{1},\ldots,s_{k} be a kk-sequence and A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}] be an alignment of SS. As defined in Equation (13), Vγ3[A]{\mathrm{V}^{3}_{\!\gamma}}[A] takes into account the lengths of induced alignments of AA. In the optimization version of 𝐍𝐌𝐒𝐀-3\mathbf{NMSA}\text{-}3, given a kk-tuple SS, the task is to determine optVγ3(S)\mathrm{opt}{\mathrm{V}^{3}_{\!\gamma}}(S) for a fixed matrix γ\gamma.

Let N=i|si|N=\sum_{i}\left|s_{i}\right|. Notice that an alignment AA that has exactly only one symbol different of space in each column is such that h=1k1i=h+1k|A(h,i)|=(k1)N\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\left|A_{(h,i)}\right|=(k-1)N.

Here, each entry D(v,L)D(\vec{v},L) of DD stores the SP-score of an alignment AA of the prefix S(v)S(\vec{v}) with the minimum SP-score such that i<h|A{i,h}|=L\sum_{i<h}|A_{\{i,h\}}|=L. The Boolean vectors b\vec{b} are used to represent the contribution to the sum of the lengths of the induced alignments. Thus, we define b=(k2)h<i,bh=bi=01\|\vec{b}\|={k\choose 2}-\sum_{h<i,b_{h}=b_{i}=0}1. Notice that if bS(v)\vec{b}\cdot S(\vec{v}) is the last column of an alignment AA and L=h=1k1i=h+1k|A{h,i}|L=\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}|A_{\{h,i\}}| is the sum of the lengths of the alignments induced by AA, then the sum of the lengths of the alignments induced by A(1:|A|1)A(1\!:\!\left|A\right|-1) is LbL-\|\vec{b}\|. Therefore,

D(v,L)={0,if v=0L=0,,if v=0L0 or v0L=0,minbk,bv,bL{D(vb,Lb)+vSPγ[bS(v)]},otherwise.D(\vec{v},L)=\left\{\begin{array}[]{ll}0\,,&\mbox{if $\vec{v}=\vec{0}$, $L=0$}\,,\\ \infty\,,&\mbox{if $\vec{v}=\vec{0}$, $L\not=0$ or $\vec{v}\not=\vec{0}$, $L=0$}\,,\\ \displaystyle\min_{\vec{b}\in\mathcal{B}_{k},\vec{b}\leq\vec{v},\|\vec{b}\|\leq L}\left\{D(\vec{v}-\vec{b},L-\|\vec{b}\|)+v{\mathrm{SP}_{\!\gamma}}[\vec{b}\cdot S(\vec{v})]\right\}\,,&\mbox{otherwise}\,.\end{array}\right.

Algorithm 4 provides more details about the procedure for computing optVγ3\mathrm{opt}{\mathrm{V}^{3}_{\!\gamma}}.

Algorithm 4
0: a kk-sequence S=s1,,skS=s_{1},\ldots,s_{k} such that ni=|si|n_{i}=\left|s_{i}\right|
0:optVγ3(S)\mathrm{opt}{\mathrm{V}^{3}_{\!\gamma}}(S)
1:D(0,0)0D(\vec{0},0)\leftarrow 0
2:for each L0L\not=0 do D(0,L)D(\vec{0},L)\leftarrow\infty
3:for each v0\vec{v}\not=\vec{0} do D(v,0)D(\vec{v},0)\leftarrow\infty
4:for each 0<vn\vec{0}<\vec{v}\leq\vec{n} in lexicographical order do
5:  for L1,2,,N(k1)L\leftarrow 1,2,\ldots,N(k-1) do
6:   D(v,L)minbk,bv,bL{D(vb,Lb)+vSPγ[bS(v)]}D(\vec{v},L)\leftarrow\min_{\vec{b}\in\mathcal{B}^{{k}},\vec{b}\leq\vec{v},\|\vec{b}\|\leq L}\big{\{}D(\vec{v}-\vec{b},L-\|\vec{b}\|)+v{\mathrm{SP}_{\!\gamma}}[\vec{b}\cdot S(\vec{v})]\big{\}}
7:return  minL{D(n,L)/L}\min_{L}\big{\{}D(\vec{n},L)/L\big{\}}

Assume that all sequences in SS have length nn. In this case, table DD has (nk2nk+1)(n+1)k=O(k2(n+1)k+1)(nk^{2}-nk+1)\cdot(n+1)^{k}=O(k^{2}(n+1)^{k+1}) entries. Since the time required to determine each entry of DD is O(2kk2)O(2^{k}k^{2}), the running time of Algorithm 4 is O(2kk2k2(n+1)k+1)=O(2kk4(n+1)k+1)O(2^{k}k^{2}\cdot k^{2}(n+1)^{k+1})=O(2^{k}k^{4}(n+1)^{k+1}).

5 Approximation algorithms for 𝐌𝐒𝐀\mathbf{MSA} and 𝐍𝐌𝐒𝐀-2\mathbf{NMSA}\text{-}2

Gusfield [Gus93] described a 2-approximation algorithm for 𝐌𝐒𝐀\mathbf{MSA} assuming that γ𝕄C\gamma\in\mathbb{M}^{\mathrm{C}}. In this section, we adapt Gusfield’s algorithm, thus proposing a 6-approximation algorithm for 𝐌𝐒𝐀\mathbf{MSA} when γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} and a 12-approximation algorithm for 𝐍𝐌𝐒𝐀-2\mathbf{NMSA}\text{-}2 problem when γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}}.

We consider here a scoring function v=vAγv=v{\mathrm{A}_{\gamma}} and v=vNγv=v{\mathrm{N}_{\gamma}} when γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} and γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}}, respectively. Also, for a 2-sequence s,ts,t, define opt(s,t)=minA𝒜s,t{v[A]}\mathrm{opt}(s,t)=\min_{A\in\mathcal{A}_{s,t}}\{v[A]\} and an vv-optimal alignment of s,ts,t is an alignment AA such v[A]=opt(s,t)v[A]=\mathrm{opt}(s,t). It follows from [AS06] that opt\mathrm{opt} is a metric on Σ\Sigma^{*}. For a kk-sequence SS, define V[A]=h=1k1i=h+1kv[A{h,i}]V[A]=\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}v[A_{\{h,i\}}] and OPT(S)=minA𝒜Sv(A)\mathrm{OPT}(S)=\min_{A\in\mathcal{A}_{S}}v(A) and a VV-optimal alignment of SS is an alignment AA such that V[A]=OPT(S)V[A]=\mathrm{OPT}(S).

Let cc be an integer, 1ck1\leq c\leq k. A star XX with center cc (also called cc-star) of the kk-sequence S=s1,,skS=s_{1},\ldots,s_{k} is a collection of k1k-1 alignments: Xh=[sh,sch]X_{h}=[s^{\prime}_{h},s^{h}_{c}] of sh,scs_{h},s_{c} for each h<ch<c and Xh=[sch,sh]X_{h}=[s^{h}_{c},s^{\prime}_{h}] of sc,shs_{c},s_{h} for each h>ch>c. The set of all stars with center cc is denoted by 𝒳c\mathcal{X}_{c}. The score of the cc-star XX is cStar(X)=hcv[Xh]cStar(X)=\sum_{h\not=c}v[X_{h}] and an vv-optimal star is one whose score is optStar(S)=minX𝒳c,c{cStar(X)}\mathrm{optStar}(S)=\min_{X\in\mathcal{X}_{c},c\in\mathbb{N}}\{cStar(X)\}. Notice that in a vv-optimal cc-star, v(Xh)=opt(sh,sc)v(X_{h})=\mathrm{opt}(s_{h},s_{c}) if h<ch<c and v(Xh)=opt(sc,sh)v(X_{h})=\mathrm{opt}(s_{c},s_{h}) if c<hc<h and because the symmetry property of opt\mathrm{opt}, we have

optStar(S)=minc{hcopt(sh,sc)},\mathrm{optStar}(S)=\min_{c}\left\{\sum_{h\not=c}\mathrm{opt}(s_{h},s_{c})\right\},

and if |s|n\left|s\right|\leq n for each sSs\in S, optStar(S)\mathrm{optStar}(S) can be computed in O(k2n2)O(k^{2}n^{2})-time when v=vAγv=v{\mathrm{A}_{\gamma}} and O(k2n3)O(k^{2}n^{3})-time when v=vNγv=v{\mathrm{N}_{\gamma}}.

We say that alignment AA of SS and cc-star XX are compatible (AA is compatible with XX and XX is compatible with AA) in SS when either A{h,c}A_{\{h,c\}} (when h<ch<c) or A{c,h}A_{\{c,h\}} (when c<hc<h) is equal to XhX_{h} for each hh. Given the alignment AA of SS, it is easy to obtain the cc-star XX compatible with AA (and there exists only one) considering fixed cc. On the other hand, an important known result in alignment studies from Feng and Doolitte [FD87] is that we can find an alignment AA that is compatible with a given cc-star XX in O(kn)O(kn), where n|s|n\leq\left|s\right| for each sequence ss in SS. In general in this case, there exists many compatible alignments with XX.

It is easy to adapt the following result from Gusfield [Gus93] to a kk-sequence.

Lemma 5.1.

Given a kk-sequence SS,

optStar(S)2kOPT(S).\mathrm{optStar}(S)\leq\frac{2}{k}\cdot\mathrm{OPT}(S)\,.
Proof.

Let XX be a vv-optimal star of SS and cc its center, and AA an vv-optimal alignment of SS. Then,

koptStar(S)\displaystyle k\cdot\mathrm{optStar}(S) =kcStar(X)=h=1kcStar(X)=h=1khcv[Xh]=h=1khcopt(sh,sc)\displaystyle=k\cdot cStar(X)=\sum_{h=1}^{k}cStar(X)=\sum_{h=1}^{k}\sum_{h\not=c}v[X_{h}]=\sum_{h=1}^{k}\sum_{h\not=c}\mathrm{opt}(s_{h},s_{c}) (18)
2h=1k1i=h+1kopt(sh,si)\displaystyle\leq 2\cdot\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\mathrm{opt}(s_{h},s_{i}) (19)
2i=1khiv[A{h,i}]=2V[A]=2OPT(S),\displaystyle\leq 2\cdot\sum_{i=1}^{k}\sum_{h\not=i}v[A_{\{h,i\}}]=2\,V[A]=2\cdot\mathrm{OPT}(S)\,, (20)

where (18) follows from the definition of a star and from the optimality of star XX; (19) follows from triangle inequality of opt\mathrm{opt}; and (20) follows from the optimality of alignment AA.

Therefore, optStar(2/k)opt(S)\mathrm{optStar}\leq(2/k)\cdot\mathrm{opt}(S). ∎

Let A=[s,t]A=[s^{\prime},t^{\prime}] be an alignment of a 2-sequence s,ts,t. We say that a column jj is splittable in AA if s(j)-s^{\prime}(j)\not=\texttt{-}, t(j)-t^{\prime}(j)\not=\texttt{-} and min{γt(j)-,γs(j)-}γs(j)t(j)\min\{{\gamma}_{{t^{\prime}(j)}\rightarrow{\texttt{-}}},{\gamma}_{{s^{\prime}(j)}\rightarrow{\texttt{-}}}\}\leq{\gamma}_{{s^{\prime}(j)}\rightarrow{t^{\prime}(j)}}. Let J:={ji:1j1<<jm|A|and ji is splittable in A}J:=\{j_{i}\in\mathbb{N}:1\leq j_{1}<\cdots<j_{m}\leq\left|A\right|\mbox{and $j_{i}$ is splittable in $A$}\}. An AA-splitting is the alignment

[s(1:j11)s(j1)-s(j1+1:j21)s(j2)-s(jm)s(jm+1:|A|)t(1:j11)-t(j1)t(j1+1:j21)-t(j2)t(jm)t(jm+1:|A|)].\left[\begin{array}[]{cccccccccc}s^{\prime}(1\!:\!j_{1}-1)&s^{\prime}(j_{1})&\texttt{-}&s^{\prime}(j_{1}+1\!:\!j_{2}-1)&s^{\prime}(j_{2})&\texttt{-}&\ldots&s^{\prime}(j_{m})&-&s^{\prime}(j_{m}+1\!:\!\left|A\right|)\\ t^{\prime}(1\!:\!j_{1}-1)&\texttt{-}&t^{\prime}(j_{1})&t^{\prime}(j_{1}+1\!:\!j_{2}-1)&\texttt{-}&t^{\prime}(j_{2})&\ldots&-&t^{\prime}(j_{m})&t^{\prime}(j_{m}+1\!:\!\left|A\right|)\end{array}\right].

We say that JJ is required to split AA. The following proposition is used to check properties of an AA-splitting.

Proposition 5.2.

Consider γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} and 𝚊,𝚋Σ\mathtt{a},\mathtt{b}\in\Sigma. If γ𝚊->γ𝚊𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}} or γ𝚊->γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}}, then γ𝚊𝚋=γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}}.

Proof.

Since γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}, we have that γ𝚊-=γ-𝚊>0{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}={\gamma}_{{\texttt{-}}\rightarrow{\mathtt{a}}}>0 and γ𝚋-=γ-𝚋>0{\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}}={\gamma}_{{\texttt{-}}\rightarrow{\mathtt{b}}}>0. Suppose that γ𝚊->γ𝚊𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}. Then, γ𝚊-+γ-𝚋>γ𝚊->γ𝚊𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}+{\gamma}_{{\texttt{-}}\rightarrow{\mathtt{b}}}>{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}, and we have that γ𝚊𝚋=γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}} since γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}. Assume now that γ𝚊->γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}}. It follows that γ𝚋-+γ-𝚊>γ-𝚊=γ𝚊->γ𝚋𝚊{\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}}+{\gamma}_{{\texttt{-}}\rightarrow{\mathtt{a}}}>{\gamma}_{{\texttt{-}}\rightarrow{\mathtt{a}}}={\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}}, which implies that γ𝚊𝚋=γ𝚋𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{b}}\rightarrow{\mathtt{a}}} since γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}. ∎

Let X={X1,,Xc1,Xc+1,Xk}X=\{X_{1},\ldots,X_{c-1},X_{c+1},X_{k}\} be a cc-star. A XX-starsplitting is the cc-star Y={Y1,,Yc1,Yc+1,Yk}Y=\{Y_{1},\ldots,Y_{c-1},Y_{c+1},Y_{k}\} where YjY_{j} is the XjX_{j}-splitting for each jj. The next result shows that the vv-score of the star YY is bounded by the vv-score of star XX when γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} and v=vAγv=v{\mathrm{A}_{\gamma}} or γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}} and v=vNγv=v{\mathrm{N}_{\gamma}}.

Lemma 5.3.

Let SS be a kk-sequence, XX be a star of SS, YY be the XX-starsplitting and vv be a function to score alignments. Consider γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} and v=vAγv=v{\mathrm{A}_{\gamma}} or v=vNγv=v{\mathrm{N}_{\gamma}}. Then, YY is also a cc-star and

cStar(Y)3cStar(X).cStar(Y)\leq 3\cdot cStar(X)\,.
Proof.

Consider an alignment Xh=[s,t]XX_{h}=[s^{\prime},t^{\prime}]\in X and a set JJ which is required to split XhX_{h}. Then,

vAγ[Yh]\displaystyle v{\mathrm{A}_{\gamma}}[Y_{h}] =vAγ[Xh]+jJ(γs(j)-+γ-t(j)γs(j)t(j))\displaystyle=v{\mathrm{A}_{\gamma}}[X_{h}]+\sum_{j\in J}\left({\gamma}_{{s^{\prime}(j)}\rightarrow{\texttt{-}}}+{\gamma}_{{\texttt{-}}\rightarrow{t^{\prime}(j)}}-{\gamma}_{{s^{\prime}(j)}\rightarrow{t^{\prime}(j)}}\right)
vAγ[Xh]+jJ(γs(j)-+γ-t(j))vAγ[Xh]+2jJmin{γs(j)-,γ-t(j)}\displaystyle\leq v{\mathrm{A}_{\gamma}}[X_{h}]+\sum_{j\in J}\left({\gamma}_{{s^{\prime}(j)}\rightarrow{\texttt{-}}}+{\gamma}_{{\texttt{-}}\rightarrow{t^{\prime}(j)}}\right)\leq v{\mathrm{A}_{\gamma}}[X_{h}]+2\cdot\sum_{j\in J}\min\{{\gamma}_{{s^{\prime}(j)}\rightarrow{\texttt{-}}},{\gamma}_{{\texttt{-}}\rightarrow{t^{\prime}(j)}}\} (21)
vAγ[Xh]+2jJγs(j)t(j),vAγ[Xh]+2vAγ[Xh]=3vAγ[Xh],\displaystyle\leq v{\mathrm{A}_{\gamma}}[X_{h}]+2\cdot\sum_{j\in J}{\gamma}_{{s^{\prime}(j)}\rightarrow{t^{\prime}(j)}},\leq v{\mathrm{A}_{\gamma}}[X_{h}]+2\cdot v{\mathrm{A}_{\gamma}}[X_{h}]=3\cdot v{\mathrm{A}_{\gamma}}[X_{h}]\,, (22)

where (21) hold because, since γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}, γs(j)t(j){\gamma}_{{s^{\prime}(j)}\rightarrow{t^{\prime}(j)}} and (22) hold because jj is splittable. Thus, vAγ[Yh]3vAγ[Xh]v{\mathrm{A}_{\gamma}}[Y_{h}]\leq 3\cdot v{\mathrm{A}_{\gamma}}[X_{h}]. Furthermore,

vNγ[Yh]\displaystyle v{\mathrm{N}_{\gamma}}[Y_{h}] =vAγ[Yh]|Yh|3vAγ[Xh]|Yh|3vAγ[Xh]|Xh|=3vNγ[Xh].\displaystyle=\frac{v{\mathrm{A}_{\gamma}}[Y_{h}]}{\left|Y_{h}\right|}\leq\frac{3\cdot v{\mathrm{A}_{\gamma}}[X_{h}]}{\left|Y_{h}\right|}\leq\frac{3\cdot v{\mathrm{A}_{\gamma}}[X_{h}]}{\left|X_{h}\right|}=3\cdot v{\mathrm{N}_{\gamma}}[X_{h}]\,.

Hence, v(Yh)3v[Xh]v(Y_{h})\leq 3\cdot v[X_{h}] when .v=vAγv=v{\mathrm{A}_{\gamma}} or v=vAγv=v{\mathrm{A}_{\gamma}} and γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} which implies that

cStar(Y)=hcv[Yh]3hcv[Xh]=3cStar(X).cStar(Y)=\sum_{h\not=c}v[Y_{h}]\leq 3\cdot\sum_{h\not=c}v[X_{h}]=3\cdot cStar(X)\,.

Notice that the time consumption for computing an XX-splitting from XX is O(kn)O(kn) when |s|n\left|s\right|\leq n for each sSs\in S.

Considering a star XX with center cc of S=s1,,skS=s_{1},\ldots,s_{k}, there can exist many compatible alignments with a star YY which is a XX-splitting. Let CompatibleAlign be a subroutine that receives the star YY and returns an alignment AA compatible with YY. It is quite simple: if symbols sh(j1)s_{h}(j_{1}) and sc(j2)s_{c}(j_{2}) are aligned in XhX_{h}, they are also aligned in AA. Otherwise, sh(j)s_{h}(j) aligns only with - in AA. This property is enough to guarantee the approximation factor of 𝐌𝐒𝐀\mathbf{MSA} and 𝐍𝐌𝐒𝐀-2\mathbf{NMSA}\text{-}2. As an example, for S=𝚊𝚊𝚊,𝚋𝚋𝚋𝚋𝚋,𝚌𝚌,𝚍𝚍𝚍,𝚎𝚎𝚎𝚎𝚎𝚎S=\mathtt{a}\mathtt{a}\mathtt{a},\mathtt{b}\mathtt{b}\mathtt{b}\mathtt{b}\mathtt{b},\mathtt{c}\mathtt{c},\mathtt{d}\mathtt{d}\mathtt{d},\mathtt{e}\mathtt{e}\mathtt{e}\mathtt{e}\mathtt{e}\mathtt{e} and

X={[𝚊𝚊𝚊--𝚍𝚍𝚍],[𝚋𝚋-𝚋𝚋𝚋-𝚍𝚍--𝚍],[𝚌𝚌-𝚍𝚍𝚍],[---𝚍𝚍𝚍-𝚎𝚎𝚎𝚎-𝚎𝚎]}X=\left\{\left[\begin{array}[]{cccc}\mathtt{a}&\mathtt{a}&\mathtt{a}&\texttt{-}\\ \texttt{-}&\mathtt{d}&\mathtt{d}&\mathtt{d}\\ \end{array}\right],\left[\begin{array}[]{cccccc}\mathtt{b}&\mathtt{b}&\texttt{-}&\mathtt{b}&\mathtt{b}&\mathtt{b}\\ \texttt{-}&\mathtt{d}&\mathtt{d}&\texttt{-}&\texttt{-}&\mathtt{d}\end{array}\right],\left[\begin{array}[]{ccc}\mathtt{c}&\mathtt{c}&\texttt{-}\\ \mathtt{d}&\mathtt{d}&\mathtt{d}\end{array}\right],\left[\begin{array}[]{ccccccc}\texttt{-}&\texttt{-}&\texttt{-}&\mathtt{d}&\mathtt{d}&\mathtt{d}&\texttt{-}\\ \mathtt{e}&\mathtt{e}&\mathtt{e}&\mathtt{e}&\texttt{-}&\mathtt{e}&\mathtt{e}\end{array}\right]\right\}

a star with center 44, we obtain the alignment

[𝚊----𝚊𝚊-----𝚋---𝚋-𝚋𝚋𝚋------𝚌𝚌---------𝚍𝚍--𝚍---𝚎𝚎𝚎𝚎---𝚎𝚎].\left[\begin{array}[]{ccccccccccc}\mathtt{a}&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}&\mathtt{a}&\mathtt{a}&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}\\ \texttt{-}&\mathtt{b}&\texttt{-}&\texttt{-}&\texttt{-}&\mathtt{b}&\texttt{-}&\mathtt{b}&\mathtt{b}&\mathtt{b}&\texttt{-}\\ \texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}&\mathtt{c}&\mathtt{c}&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}\\ \texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}&\texttt{-}&\mathtt{d}&\mathtt{d}&\texttt{-}&\texttt{-}&\mathtt{d}&\texttt{-}\\ \texttt{-}&\texttt{-}&\mathtt{e}&\mathtt{e}&\mathtt{e}&\mathtt{e}&\texttt{-}&\texttt{-}&\texttt{-}&\mathtt{e}&\mathtt{e}\end{array}\right]\,.

Let Qmax:=max𝚊Σ{γ𝚊-,γ-𝚊}Q_{\max}:=\max_{\mathtt{a}\in\Sigma}\{{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}},{\gamma}_{{\texttt{-}}\rightarrow{\mathtt{a}}}\} and consider the following result.

Proposition 5.4.

Let SS be a kk-sequence, XX be a star of SS with center cc and YY be the XX-starsplitting. Assume that γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} and that CompatibleAlign(Y)\textsc{Compati\-bleAlign}(Y) returns A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}]. If hch\not=c and ici\not=c, we have that

  1. (i)

    γsh(j)si(j)γsh(j)sc(j)+γsc(j)si(j){\gamma}_{{s^{\prime}_{h}(j)}\rightarrow{s^{\prime}_{i}(j)}}\leq{\gamma}_{{s^{\prime}_{h}(j)}\rightarrow{s^{\prime}_{c}(j)}}+{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{s^{\prime}_{i}(j)}} for each j=1,,|A|j=1,\ldots,\left|A\right|, and

  2. (ii)

    vNγ[A{h,i}]2Qmaxv{\mathrm{N}_{\gamma}}[A_{\{h,i\}}]\leq 2\cdot Q_{\max}.

Proof.

Assume that sh(j)=𝚊,si(j)=𝚋s^{\prime}_{h}(j)=\mathtt{a},s^{\prime}_{i}(j)=\mathtt{b} and sc(j)=𝚌s^{\prime}_{c}(j)=\mathtt{c}.

First we show that (i) γ𝚊𝚋γ𝚊𝚌+γ𝚌𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}\leq{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{c}}}+{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{b}}} for each j=1,,|A|j=1,\ldots,\left|A\right|, by analyzing all possible values of 𝚊,𝚋\mathtt{a},\mathtt{b} and 𝚌\mathtt{c}. The case when 𝚊=-\mathtt{a}=\texttt{-} or 𝚋=-\mathtt{b}=\texttt{-} can be checked by definition of γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}. Thus, we assume that 𝚊-\mathtt{a}\not=\texttt{-} and 𝚋-\mathtt{b}\not=\texttt{-}, which implies by the alignment construction in CompatibleAlign, that 𝚌-\mathtt{c}\not=\texttt{-}. Since 𝚊-\mathtt{a}\not=\texttt{-}, 𝚋-\mathtt{b}\not=\texttt{-}, 𝚌-\mathtt{c}\not=\texttt{-} and YY is a starsplitting, we have that γ𝚊->γ𝚌𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{a}}}, γ𝚋->γ𝚌𝚋{\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{b}}} and, since γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}, γ-𝚋=γ𝚋->γ𝚌𝚋{\gamma}_{{\texttt{-}}\rightarrow{\mathtt{b}}}={\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{b}}}. Since γ𝚊->γ𝚌𝚊{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{a}}}, it follows from Proposition 5.2 that γ𝚊->γ𝚊𝚌{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}>{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{c}}}. Hence, γ𝚊-+γ-𝚋>γ𝚊𝚌+γ𝚌𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}+{\gamma}_{{\texttt{-}}\rightarrow{\mathtt{b}}}>{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{c}}}+{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{b}}}, which implies from the definition of 𝕄W\mathbb{M}^{\mathrm{W}} that γ𝚊𝚋γ𝚊𝚌+γ𝚌𝚋{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}\leq{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{c}}}+{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{b}}}.

Finally, we show (ii). Here, it is enough to prove that γ𝚊𝚋2Qmax{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}\leq 2\cdot Q_{\max} for each column [𝚊,𝚋][\mathtt{a},\mathtt{b}] of vNγ[A{h,i}]v{\mathrm{N}_{\gamma}}[A_{\{h,i\}}]. Again, the case when 𝚊=-\mathtt{a}=\texttt{-} or 𝚋=-\mathtt{b}=\texttt{-} can easily be checked. Thus, assume that 𝚊-\mathtt{a}\not=\texttt{-} and 𝚋-\mathtt{b}\not=\texttt{-} which implies by construction that 𝚌-\mathtt{c}\not=\texttt{-}. Since YY is a splitting, it follows that γ𝚌𝚊<γ𝚊-{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{a}}}<{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}} and γ𝚌𝚋<γ𝚋-{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{b}}}<{\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}}. Since γ𝚌𝚊<γ𝚊-{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{a}}}<{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}, it follows from Preposition 5.2 that γ𝚊𝚌<γ𝚊-{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{c}}}<{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}. It follows from (i) that γ𝚊𝚋γ𝚊𝚌+γ𝚌𝚋<γ𝚊-+γ𝚋-Qmax+Qmax=2Qmax{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{b}}}\leq{\gamma}_{{\mathtt{a}}\rightarrow{\mathtt{c}}}+{\gamma}_{{\mathtt{c}}\rightarrow{\mathtt{b}}}<{\gamma}_{{\mathtt{a}}\rightarrow{\texttt{-}}}+{\gamma}_{{\mathtt{b}}\rightarrow{\texttt{-}}}\leq Q_{\max}+Q_{\max}=2\cdot Q_{\max}. Consequently, we have that vNγ[A{h,i}]=vAγ[A{h,i}]/|A{h,i}|2Qmax|A{h,i}|/|A{h,i}|=2Qmaxv{\mathrm{N}_{\gamma}}[A_{\{h,i\}}]=v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]/\left|A_{\{h,i\}}\right|\leq 2\cdot Q_{\max}\left|A_{\{h,i\}}\right|/\left|A_{\{h,i\}}\right|=2\cdot Q_{\max}. ∎

Lemma 5.5.

Let SS be a kk-sequence, XX a star of SS with center cc, YY a XX-starsplitting and A=CompatibleAlign(Y)A=\textsc{CompatibleAlign}(Y). Then,

  1. (i)

    vAγ[A{h,i}]vAγ[A{h,c}]+vAγ[A{c,i}]v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]\leq v{\mathrm{A}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{A}_{\gamma}}[A_{\{c,i\}}]  when γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}} and

  2. (ii)

    vNγ[A{h,i}]2(vNγ[A{h,c}]+vNγ[A{c,i}])v{\mathrm{N}_{\gamma}}[A_{\{h,i\}}]\leq 2\cdot\Big{(}v{\mathrm{N}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{N}_{\gamma}}[A_{\{c,i\}}]\Big{)}  when γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}},

for each h<ih<i, hch\not=c, ici\not=c.

Proof.

Consider A=[s1,,sk]A=[s^{\prime}_{1},\ldots,s^{\prime}_{k}], h,ih,i\in\mathbb{N} and J={j:sc(j)- and sh(j)=si(j)=-}J=\{j:s^{\prime}_{c}(j)\not=\texttt{-}\mbox{ and }s^{\prime}_{h}(j)=s^{\prime}_{i}(j)=\texttt{-}\}. Then,

vAγ[A{h,i}]+jJ(γ-sc(j)+γsc(j)-)\displaystyle v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]+\sum_{j\in J}({\gamma}_{{\texttt{-}}\rightarrow{s^{\prime}_{c}(j)}}+{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{\texttt{-}}}) =jJγsh(j)si(j)+jJ(γ-sc(j)+γsc(j)-)\displaystyle=\sum_{j\not\in J}{\gamma}_{{s^{\prime}_{h}(j)}\rightarrow{s^{\prime}_{i}(j)}}+\sum_{j\in J}({\gamma}_{{\texttt{-}}\rightarrow{s^{\prime}_{c}(j)}}+{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{\texttt{-}}})
jJ(γsh(j)sc(j)+γsc(j)si(j))+jJ(γ-sc(j)+γsc(j)-)\displaystyle\leq\sum_{j\not\in J}\left({\gamma}_{{s^{\prime}_{h}(j)}\rightarrow{s^{\prime}_{c}(j)}}+{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{s^{\prime}_{i}(j)}}\right)+\sum_{j\in J}({\gamma}_{{\texttt{-}}\rightarrow{s^{\prime}_{c}(j)}}+{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{\texttt{-}}})
=vAγ[A{h,c}]+vAγ[A{c,i}],\displaystyle=v{\mathrm{A}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{A}_{\gamma}}[A_{\{c,i\}}]\,,

where the inequality holds due to Proposition 5.4. Therefore,

vAγ[A{h,i}]\displaystyle v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}] vAγ[A{h,c}]+vAγ[A{c,i}]jX(γ-sc(j)+γsc(j)-).\displaystyle\leq v{\mathrm{A}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{A}_{\gamma}}[A_{\{c,i\}}]-\sum_{j\in X}({\gamma}_{{\texttt{-}}\rightarrow{s^{\prime}_{c}(j)}}+{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{\texttt{-}}})\,. (23)

Since γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}, we have that γ-sc(j),γsc(j)->0{\gamma}_{{\texttt{-}}\rightarrow{s^{\prime}_{c}(j)}},{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{\texttt{-}}}>0. It follows from (23) that (i) is proven.

For (ii), observe that, by definition of 𝕄W\mathbb{M}^{\mathrm{W}}, we have that

Qmax=maxσΣ{γσ-,γ-σ}=maxσΣ{γσ-}2γsc(j)-=γ-sc(j)+γsc(j)-Q_{\max}=\max_{\sigma\in\Sigma}\big{\{}{\gamma}_{{\sigma}\rightarrow{\texttt{-}}},{\gamma}_{{\texttt{-}}\rightarrow{\sigma}}\big{\}}=\max_{\sigma\in\Sigma}\big{\{}{\gamma}_{{\sigma}\rightarrow{\texttt{-}}}\big{\}}\leq 2\,{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{\texttt{-}}}={\gamma}_{{\texttt{-}}\rightarrow{s^{\prime}_{c}(j)}}+{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{\texttt{-}}}

for every jj. Furthermore, following these statements, we have that

vNγ[A{h,i}]\displaystyle v{\mathrm{N}_{\gamma}}[A_{\{h,i\}}] =vAγ[A{h,i}]|A{h,i}|\displaystyle=\frac{v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]}{|A_{\{h,i\}}|}
vAγ[A{h,i}]+2Qmax|J||A{h,i}|+|J| 2vAγ[A{h,i}]+Qmax|J||A{h,i}|+|J|\displaystyle\leq\frac{v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]+2\cdot Q_{\max}\left|J\right|}{|A_{\{h,i\}}|+\left|J\right|}\;\leq\;2\cdot\frac{v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]+Q_{\max}\left|J\right|}{|A_{\{h,i\}}|+\left|J\right|} (24)
2vAγ[A{h,c}]+vAγ[A{c,i}]jJ(γ-sc(j)+γsc(j)-)+Qmax|J||A{h,i,c}||J|+|J|\displaystyle\leq 2\cdot\frac{v{\mathrm{A}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{A}_{\gamma}}[A_{\{c,i\}}]-\sum_{j\in J}({\gamma}_{{\texttt{-}}\rightarrow{s^{\prime}_{c}(j)}}+{\gamma}_{{s^{\prime}_{c}(j)}\rightarrow{\texttt{-}}})+Q_{\max}\left|J\right|}{|A_{\{h,i,c\}}|-\left|J\right|+\left|J\right|} (25)
2vAγ[A{h,c}]+vAγ[A{c,i}]Qmax|J|+Qmax|J||A{h,i,c}||J|+|J|=2(vAγ[A{h,c}]|[A{h,i,c}]|+vAγ[A{c,i}]|[A{h,i,c}]|)\displaystyle\leq 2\cdot\frac{v{\mathrm{A}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{A}_{\gamma}}[A_{\{c,i\}}]-Q_{\max}\left|J\right|+Q_{\max}\left|J\right|}{|A_{\{h,i,c\}}|-\left|J\right|+\left|J\right|}=2\cdot\left(\frac{v{\mathrm{A}_{\gamma}}[A_{\{h,c\}}]}{|[A_{\{h,i,c\}}]|}+\frac{v{\mathrm{A}_{\gamma}}[A_{\{c,i\}}]}{|[A_{\{h,i,c\}}]|}\right) (26)
2(vAγ[A{h,c}]|A{h,c}|+vAγ[A{c,i}]|A{c,i}|)=2(vNγ[A{h,c}]+vNγ[A{c,i}]),\displaystyle\leq 2\cdot\left(\frac{v{\mathrm{A}_{\gamma}}[A_{\{h,c\}}]}{|A_{\{h,c\}}|}+\frac{v{\mathrm{A}_{\gamma}}[A_{\{c,i\}}]}{|A_{\{c,i\}}|}\right)=2\cdot\Big{(}v{\mathrm{N}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{N}_{\gamma}}[A_{\{c,i\}}]\Big{)}\,, (27)

where the first inequality of (24) is a consequence of Proposition 5.4 and the second inequality follows since every entry of γ\gamma is nonnegative, (25) follows from (23) and from |A{h,i}|=|A{h,i,c}||J||A_{\{h,i\}}|=|A_{\{h,i,c\}}|-\left|J\right|, (26) follows as a consequence of γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}}, and (27) follows as a consequence of |A{h,c}||A{h,i,c}||A_{\{h,c\}}|\leq|A_{\{h,i,c\}}| and |A{c,i}||A{h,i,c}||A_{\{c,i\}}|\leq|A_{\{h,i,c\}}|. ∎

Observe now that the running time of CompatibleAlign is O(k2n)O(k^{2}n).

Algorithm 5
0:kk-sequence SS
0:v[A]v[A] such that A𝒜SA\in\mathcal{A}_{S} andvSPγ[A]6optSPγ(S)v{\mathrm{SP}_{\!\gamma}}[A]\leq 6\cdot\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S) if v=vAγv=v{\mathrm{A}_{\gamma}} and γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}, andVγ2[A]12optNSPγ2(S){\mathrm{V}^{2}_{\!\gamma}}[A]\leq 12\cdot\mathrm{opt}{\mathrm{NSP}^{2}_{\!\gamma}}(S) if v=vNγv=v{\mathrm{N}_{\gamma}} and γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}}.
1: Let XX be a vv-optimal star of SS with center cc
2: Compute the XX-splitting YY
3:ACompatibleAlign(Y)A\leftarrow\textsc{CompatibleAlign}(Y)
4:return v[A]v[A]
Theorem 5.6.

Let SS be a kk-sequence and γ\gamma be a scoring matrix. Then, Algorithm 5 computes v[A]v[A] correctly,

  1. (i)

    in O(k2n2)O(k^{2}n^{2})-time such that vSPγ[A]6optSPγ(S)v{\mathrm{SP}_{\!\gamma}}[A]\leq 6\cdot\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S), if v=vAγv=v{\mathrm{A}_{\gamma}} and γ=𝕄W\gamma=\mathbb{M}^{\mathrm{W}},

  2. (ii)

    in O(k2n3)O(k^{2}n^{3})-time such that Vγ2[A]12optNSPγ2(S){\mathrm{V}^{2}_{\!\gamma}}[A]\leq 12\cdot\mathrm{opt}{\mathrm{NSP}^{2}_{\!\gamma}}(S), if v=vNγv=v{\mathrm{N}_{\gamma}} and γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}} ,

where AA is the alignment of SS computed by the algorithm.

Proof.

Clearly, the value returned by the Algorithm 5 is a score of an alignment of SS.

We show then that the approximation factor is as expected. Let cc be a center of the stars XX and YY found in the first two steps of Algorithm 5. Notice then that

h=1k1i=h+1k(v[A{h,c}]+v[A{c,i}])\displaystyle\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\big{(}v[A_{\{h,c\}}]+v[A_{\{c,i\}}]\Big{)} =(k1)cStar(Y)\displaystyle=(k-1)\cdot cStar(Y) (28)
3(k1)cStar(D(X))\displaystyle\leq 3\cdot(k-1)\cdot cStar(D(X)) (29)
3(k1)2kopt(S)=6k1hopt(S)6opt(S),\displaystyle\leq 3\cdot(k-1)\cdot\frac{2}{k}\cdot\mathrm{opt}(S)=6\cdot\frac{k-1}{h}\cdot\mathrm{opt}(S)\leq 6\cdot\mathrm{opt}(S)\,, (30)

where the equality (28) follows since cc is the center of star YY which is a compatible with alignment AA, (29) follows from Lemma 5.3 and (30) follows from Lemma 5.1.

Suppose then that v=vAγv=v{\mathrm{A}_{\gamma}} and γ𝕄W\gamma\in\mathbb{M}^{\mathrm{W}}. Thus,

vSPγ[A]\displaystyle v{\mathrm{SP}_{\!\gamma}}[A] =h=1k1i=h+1kvAγ[A{h,i}]\displaystyle=\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}v{\mathrm{A}_{\gamma}}[A_{\{h,i\}}]
h=1k1i=h+1k(vAγ[A{h,c}]+vAγ[A{c,i}])6optSPγ(S),\displaystyle\leq\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\Big{(}v{\mathrm{A}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{A}_{\gamma}}[A_{\{c,i\}}]\Big{)}\leq 6\cdot\mathrm{opt}{\mathrm{SP}_{\!\gamma}}(S)\,,

where the first inequality follows from Lemma 5.5 and the second follows from Equation (30).

Suppose now that v=vNγv=v{\mathrm{N}_{\gamma}} and γ𝕄N\gamma\in\mathbb{M}^{\mathrm{N}}. Thus,

V2γ[A]\displaystyle{\mathrm{V}^{\gamma}_{\!2}}[A] =h=1k1i=h+1kvNγ[A{h,i}]\displaystyle=\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}v{\mathrm{N}_{\gamma}}[A_{\{h,i\}}]
2h=1k1i=h+1k(vNγ[A{h,c}]+vNγ[A{c,i}])\displaystyle\leq 2\cdot\sum_{h=1}^{k-1}\sum_{i=h+1}^{k}\Big{(}v{\mathrm{N}_{\gamma}}[A_{\{h,c\}}]+v{\mathrm{N}_{\gamma}}[A_{\{c,i\}}]\Big{)}
26optNSP2γ(S)=12optNSP2γ(S),\displaystyle\leq 2\cdot 6\cdot\mathrm{opt}{\mathrm{NSP}^{\gamma}_{\!2}}(S)=12\cdot\mathrm{opt}{\mathrm{NSP}^{\gamma}_{\!2}}(S)\,,

where, similarly, the first inequality follows from Lemma 5.5 and the second follows from Equation (30).

The time required to find an optimal vv-star is the time to compute the pairwise alignments of SS, which is (k2)O(n2){k\choose 2}O(n^{2}) if v=vAγv=v{\mathrm{A}_{\gamma}} and it is (k2)O(n3){k\choose 2}O(n^{3}) if v=vNγv=v{\mathrm{N}_{\gamma}}. Additionally, we have to consider the time to determine the optimal star, which is O(k2)O(k^{2}), implying that the time required to compute line 1 of Algorithm 5 is (k2)O(n2)+O(k2)=O(k2n2){k\choose 2}O(n^{2})+O(k^{2})=O(k^{2}n^{2}) if v=vAγv=v{\mathrm{A}_{\gamma}} and (k2)O(n3)+O(k2)=O(k2n3){k\choose 2}O(n^{3})+O(k^{2})=O(k^{2}n^{3}) if v=vNγv=v{\mathrm{N}_{\gamma}}. The time spent to compute lines 2 and 3 are O(kn)O(kn) and O(k2n)O(k^{2}n), respectively, and to compute line 4 is O(k3n)O(k^{3}n), since we have to compute the score of (k2)=O(k2){k\choose 2}=O(k^{2}) pairwise alignments of length O(kn)O(kn). Therefore, the total time spent by the algorithm is O(k2n2+k3n)O(k^{2}n^{2}+k^{3}n) if v=vAγv=v{\mathrm{A}_{\gamma}} and O(k2n3+k3n)O(k^{2}n^{3}+k^{3}n) if v=vNγv=v{\mathrm{N}_{\gamma}}. ∎

6 Conclusion and future work

We presented and discussed multiple aspects of normalized multiple sequence alignment (NMSA). We defined three new criteria for computing normalized scores when aligning multiple sequences, showing the NP-hardness and exact algorithms for solving the 𝐍𝐌𝐒𝐀-z\mathbf{NMSA}\text{-}z given each criterion z=1,2,3z=1,2,3. In addition, we adapted an existing 2-approximation algorithm for 𝐌𝐒𝐀\mathbf{MSA} when the scoring matrix γ\gamma is in the classical class 𝕄C\mathbb{M}^{\mathrm{C}}, leading to a 6-approximation algorithm for 𝐌𝐒𝐀\mathbf{MSA} when γ\gamma is in the broader class 𝕄W𝕄C\mathbb{M}^{\mathrm{W}}\supseteq\mathbb{M}^{\mathrm{C}} and to a 12-approximation for 𝐍𝐌𝐒𝐀-2\mathbf{NMSA}\text{-}2 when γ\gamma is in 𝕄N𝕄W\mathbb{M}^{\mathrm{N}}\subseteq\mathbb{M}^{\mathrm{W}}, a slightly more restricted class such that the cost of deletion for any symbol is at most twice the cost for any other. We summarize these contributions in Table=reftableconclusion.

problems time of exact algorithms time of approximation algorithm
𝐌𝐒𝐀\mathbf{MSA} O(k2n2+k3n)O(k^{2}n^{2}+k^{3}n)
𝐍𝐌𝐒𝐀-1\mathbf{NMSA}\text{-}1 O(2kk3(n+1)k+1)O(2^{k}k^{3}(n+1)^{k+1})
𝐍𝐌𝐒𝐀-2\mathbf{NMSA}\text{-}2 O((1+12n+1)k(2n+1)k2k2)O\left(\left(1+\frac{1}{2n+1}\right)^{k}(2n+1)^{k^{2}}k^{2}\right) O(k2n2+k3n)O(k^{2}n^{2}+k^{3}n)
𝐍𝐌𝐒𝐀-3\mathbf{NMSA}\text{-}3 O(2kk4(n+1)k+1)O(2^{k}k^{4}(n+1)^{k+1})
Table 1: We are considering a kk-sequence where each sequence has maximum length of nn; all the problem decision version are NP-complete.

This work is an effort to expand the boundaries of multiple sequence alignment algorithms towards normalization, an unexplored domain that can produce results with higher accuracy in some applications. In future work, we will implement our algorithms in order to verify how large are the sequences our algorithms are able to handle. Also, we plan to perform practical experiments, measuring how well alignments provided by our algorithms and other MSA algorithms agree with multiple alignment benchmarks. In addition, we intend to measure the accuracy of phylogenetic tree reconstruction based on our alignments for simulated and real genomes. Finally, we will work on heuristics and parallel versions of our algorithms in order to faster process large datasets.

References

  • [AE99] A. N. Arslan and Ö. Egecioglu. An efficient uniform-cost normalized edit distance algorithm. In Proc. of SPIRE, pages 9–15. IEEE, 1999.
  • [AG97] A. Apostolico and Z. Galil. Pattern Matching Algorithms. Oxford University Press, 1997.
  • [AKO10] A. Andoni, R. Krauthgamer, and K. Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In Proc. of FOCS, pages 377–386. IEEE, 2010.
  • [AS06] E. Araujo and J. Soares. Scoring matrices that induce metrics on sequences. In Proc. of LATIN, pages 68–79, 2006.
  • [AT00] A. Abbott and A. Tsay. Sequence analysis and optimal matching methods in sociology: Review and prospect. Sociol Method Res, 29(1):3–33, 2000.
  • [BI18] A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM J Comput, 47(3):1087–1097, 2018.
  • [BL02] R. Barzilay and L. Lee. Bootstrapping lexical choice via multiple-sequence alignment. In Proc. of EMNLP, pages 164–171, USA, 2002. Association for Computational Linguistics.
  • [CB99] J. A. Cuff and G. J. Barton. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins, 34(4):508–519, 1999.
  • [CDG+20] D. Chakraborty, D. Das, E. Goldenberg, M. Kouckỳ, and M. Saks. Approximating edit distance within constant factor in truly sub-quadratic time. J ACM, 67(6):1–22, 2020.
  • [Chv80] V. Chvátal. Recognizing intersection patterns. Ann Discrete Math, 8:249–251, 1980.
  • [CL88] H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM J Appl Math, 48(5):1073–1082, 1988.
  • [CLZU03] Maxime Crochemore, Gad M Landau, and Michal Ziv-Ukelson. A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM journal on computing, 32(6):1654–1673, 2003.
  • [Eli06] I. Elias. Settling the intractability of multiple alignment. J Comput Biol, 13(7):1323–1339, 2006.
  • [FD87] D.-F. Feng and R. F. Doolittle. Progressive sequence alignment as a prerequisitetto correct phylogenetic trees. J Mol Evol, 25(4):351–360, 1987.
  • [Gus93] D. Gusfield. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull Math Biol, 55(1):141–154, 1993.
  • [HAR09] W. Haque, A. Aravind, and B. Reddy. Pairwise sequence alignment algorithms: A survey. In Proc. of ISTA, pages 96–103. ACM Press, 2009.
  • [HTHI95] M. Hirosawa, Y. Totoki, M. Hoshida, and M. Ishikawa. Comprehensive study on iterative algorithms of multiple sequence alignment. Bioinformatics, 11(1):13–18, 1995.
  • [Lev66] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. Nauk., 10(8):707–710, 1966.
  • [MP80] W. J. Masek and M. S. Paterson. A faster algorithm computing string edit distances. J Comput Syst Sci, 20(1):18–31, 1980.
  • [MV93] A. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE T Pattern Anal, 15(9):926–932, 1993.
  • [NW70] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 48(3):443–453, 1970.
  • [OR06] T. H. Ogden and M. S. Rosenberg. Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol, 55(2):314–328, 2006.
  • [Sel74] P. H. Sellers. On the theory and computation of evolutionary distances. SIAM J Appl Math, 26(4):787–793, 1974.
  • [SH14] F. Sievers and D. G. Higgins. Clustal Omega. Curr Protoc Bioinfo, 48(1):3.13.1–3.13.16, 2014.
  • [SWD+11] F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Söding, J. D. Thompson, and D. G. Higgins. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol, 7(1):539, 2011.
  • [TPP99] J. D. Thompson, F. Plewniak, and O. Poch. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res, 27(13):2682–2690, 1999.
  • [VMA95] E. Vidal, A. Marzal, and P. Aibar. Fast computation of normalized edit distances. IEEE T Pattern Anal, 17(9):899–902, 1995.
  • [WLXZ15] X.-D. Wang, J.-X. Liu, Y. Xu, and J. Zhang. A survey of multiple sequence alignment techniques. In Proc. of ICIC, pages 529–538. Springer, 2015.
  • [WOHN06] I. M. Wallace, O. O’Sullivan, D. G. Higgins, and C. Notredame. M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res, 34(6):1692–1699, 2006.