Minimum common string partition: on solving large‐scale problem instances

AuthorChristian Blum
Published date01 January 2020
Date01 January 2020
DOIhttp://doi.org/10.1111/itor.12603
Intl. Trans. in Op. Res. 27 (2020) 91–111
DOI: 10.1111/itor.12603
INTERNATIONAL
TRANSACTIONS
IN OPERATIONAL
RESEARCH
Minimum common string partition: on solving large-scale
problem instances
Christian Blum
Artificial Intelligence ResearchInstitute (IIIA-CSIC), Campus of the UAB,
Carrer de Can Planas, 08193 Bellaterra, Spain
E-mail: christian.blum@iiia.csic.es [Blum]
Received 8 February2018; received in revised form 9 July 2018; accepted 20 September 2018
Abstract
Minimum common string partition is an NP-hard combinatorial optimization problem from the bioinfor-
matics field. The current state-of-the-art algorithm is a hybrid technique known as construct, merge, solve,
and adapt (CMSA). This algorithm combines two main algorithmic components: generating solutions in a
probabilistic way and solving reduced subinstances obtained from the tackled problem instances, if possible,
to optimality. However, the CMSA algorithm was not intended for application to very large problem in-
stances. Therefore, in this paper wepresent a technique that makes CMSA, and other availablealgorithms for
this problem, applicable to problem instances that are about one order of magnitude larger than the largest
problem instances considered so far. Moreover, a reduced variable neighborhood search (RVNS) for solving
the tackled problem, based on integer programming, is introduced. The experimental results show that the
modified CMSA algorithm is very strong forproblem instances based on rather small alphabets.With growing
alphabet size, it turns out that RVNS has a growing advantage over CMSA.
Keywords:minimum common string partition; large-scale problem instances; reduced variable neighborhood search
1. Introduction
Optimization problemsbased on strings are abundant in research fields such as bioinformatics (Gus-
field, 1997; Blum and Festa, 2016) and text processing (Manning et al., 2008). In the context of
computer science, a string sis a datatype for representing and storing sequence information in terms
of a finite sequence of characters from a (generally finite) alphabet . Each string shas a length
denoted by |s|. Words, and even whole texts, may be stored by means of strings. Strings also play an
important role in bioinformatics because most of the genetic instructions involved in the growth,
development, functioning, and reproduction of living organisms are stored in deoxyribonucleic
acid (DNA) molecules, which can be represented by strings over the alphabet ={A,C,T,G}.
C
2018 The Authors.
International Transactionsin Operational Research C
2018 International Federation ofOperational Research Societies
Published by John Wiley & Sons Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main St, Malden, MA02148,
USA.
92 C. Blum / Intl. Trans. in Op. Res. 27 (2020) 91–111
Similarly, most proteins can be stored as strings of letters from an alphabet of size 20, representing
the 20 standard amino acids that are the building blocks of most proteins.
The minimum common string partition (MCSP) problem is a string-based combinatorial op-
timization problem from the bioinformatics field. The problem input consists of two strings, s1
and s2. Both input strings are composed of letters from a finite alphabet . Moreover, they ful-
fill the property of being related, which means that each letter of has the same number of
occurrences in each input string. The property of being related implies that s1and s2have the
same length n,thatis,|s1|=|s2|=n. A candidate solution to the MCSP problem is obtained by
cutting s1, respectively, s2, into pieces, resulting in a set P1, respectively, P2, of nonoverlapping
substrings. A candidate solution (P1,P2)is a valid solution, if P1=P2. The optimization goal
consists in finding a valid solution (P1,P2)that minimizes |P1|.1As an example, consider the fol-
lowing two DNA sequences: s1=AAGACTG and s2=ACTAGGA. Counting the number of the
occurrences of all letters in both strings, it is easy to confirm that these two strings are related.
A trivial valid solution can be obtained for any problem instance by cutting both input strings
into substrings of length 1. In the case of the above-mentioned example, the trivial valid solution
is P1=P2={A,A,A,C,T,G,G}, with an objective function value of 7. Note that the optimal
solution in this example is P1=P2={ACT,AG ,G,A}, with objective function value 4. The MCSP
problem, as pointed out by Chen et al. (2005), is closelyrelated to the problem of sorting by reversals
with duplicates, whichis one of the key problems in genome rearrangement. It has been shown to be
NP-hard even in very restrictive cases (Goldstein et al., 2005). Moreover, Jiang et al. (2012) proved
the NP-completeness of the decision version of the MCSPcproblem—with cbeing the alphabet
size—for c2.
1.1. Existing works
Research work on the MCSP began with the development of approximation algorithms (see, e.g.,
Shapira and Storer, 2002; Chrobak et al., 2004; Kolman, 2005; Cormode and Muthukrishnan,
2007; Kolman and Wale ´
n, 2007). More recently, Goldstein and Lewenstein (2011) proposed a
greedy algorithm that runs in O(n)time. Another greedy algorithm obtaining better results (on
average) was presented in He (2007). The first metaheuristics proposed in the related literature for
the MCSP problem were (a) the Min-Max Ant System from Ferdous and Sohel Rahman (2013,
2017) and (b) the probabilistictree search algorithm from Blum et al. (2014). Both works used a range
of artificial and real DNA instances from Ferdous and Sohel Rahman (2013) for the experimental
evaluation. However,it was discovered that the best results,at least for the smaller problem instances
used in Ferdous and Sohel Rahman (2013), are obtained by integer linear programming (ILP)
techniques. The first ILP model was proposedin Blum et al. (2015), together with a deterministic 2-
phase heuristic based on the ILP model. Later on, improved ILP models were presented in Ferdous
and Sohel Rahman (2015) and Blum and Raidl (2016). The current state-of-the-art technique is a
construct, merge,solve, and adapt (CMSA) approach from Blum et al. (2016). This hybrid technique
sequentially applies an ILP solver to a reduced subinstance of the original problem instance. The
CMSA variant from Blum et al. (2016) is based on the ILP model from Blum et al. (2015).
1Note that minimizing |P1|issimilar to minimizing |P2|.
C
2018 The Authors.
International Transactionsin Operational Research C
2018 International Federation ofOperational Research Societies

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT