Scheduling aspects in keyword extraction problem

Date01 March 2018
AuthorJan Węglarz,Michał Zimniewicz,Krzysztof Kurowski
Published date01 March 2018
DOIhttp://doi.org/10.1111/itor.12368
Intl. Trans. in Op. Res. 25 (2018) 507–522
DOI: 10.1111/itor.12368
INTERNATIONAL
TRANSACTIONS
IN OPERATIONAL
RESEARCH
Scheduling aspects in keyword extraction problem
Michał Zimniewicza, Krzysztof Kurowskiaand Jan W˛eglarzb
aPoznan Supercomputingand Networking Center, Institute of Bioorganic Chemistry, Polish Academy of Sciences,
Poznan, Poland
bInstitute of Computing Science, Poznan University of Technology,Poznan, Poland
E-mail: Michal.Zimniewicz@man.poznan.pl [Zimniewicz]; Krzysztof.Kurowski@man.poznan.pl [Kurowski];
Jan.Weglarz@cs.put.poznan.pl [W˛eglarz]
Received 18 October 2016; accepted 23 October 2016
Abstract
The amount of big data collected during human–computer interactions requires natural languageprocessing
(NLP) applications to be executed efficiently, especially in parallel computing environments. Scalability and
performance are critical in many NLP applications such as search engines or web indexers. However, there
is a lack of mathematical models helping users to design and apply scheduling theory for NLP approaches.
Moreover, many researchers and software architects reported various difficulties related to common NLP
benchmarks. Therefore, this paper aims to introduce and demonstrate how to apply a scheduling model for
a class of keyword extraction approaches. Additionally, we propose methods for the overall performance
evaluation of different algorithms, which are based on processing time and correctness (quality) of answers.
Finally,we present a set of experiments performed in differentcomputing environments together with obtained
results that can be used as reference benchmarks for further research in the field.
Keywords:keyword extraction; scheduling model; natural language processing; evaluation;big data; parallel computing
1. Introduction
Natural language processing (NLP) is a field on the borderline between linguistics and artificial
intelligence. In the past, the processing of a sentence by an NLP application could last even up to
several minutes (Cambria and White, 2014), and it was still acceptable. Nowadays, the amount of
big data, especially the user-generated content spread over the Internet, to be processed by search
engines, web indexers, or other information retrieval solutions requires that NLP applications
generate answers or results in a relatively short processing time. In fact, for many NLP applications
the overall performance, near real-time, is even more important than accuracy of processing results
as end users are more interested in interactive applications and search engines (Banko and Brill,
2001; Norvig, 2007).
C
2017 The Authors.
International Transactionsin Operational Research C
2017 International Federation of OperationalResearch Societies
Published by John Wiley & Sons Ltd, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main St, Malden, MA02148,
USA.
508 M. Zimniewicz et al. / Intl. Trans. in Op.Res. 25 (2018) 507–522
There is an urgent need to address many challenging research problems related to this field, in
particular new approaches in computational linguistics that could benefit from scheduling theory.
Therefore, this paper attempts to pave the way for interdisciplinary research by introducing a new
scheduling model for the well-known NLP problem—keyword extraction.
The number of different approaches to solve the keyword extraction problem is constantly grow-
ing. However, it is still difficult to select the best approach as it is not trivial to compare and assess
the overall performance of different algorithms. Therefore, we propose a new evaluation criterion
based on the quality of answers given by a keyword extraction algorithm. Moreover, we show how
to apply the scheduling model, and consequently consider a new performance measure based on
schedule length.
The rest of the paper is organized as follows. Section 2 presents basic assumptions about the
keywordextraction problem and related work. Section 3 defines a formal description of the keyword
extraction problem and introduces the scheduling model together with specific characteristics and
examples. Section 4 describes in detail proposed assessment and evaluation methods. Experiments
and obtained results are discussed in Section 5. Section 6 contains final remarks and defines some
directions for future work.
2. Keyword extraction
2.1. Problem description
In a nutshell, the aim of the keyword extraction problem is to identify terms, words, or phrases—
keywords—that describe the subjects of a natural language document in the best possible way.
Keywords of a text document are representative words that give to a human a concise summary or
theme overviewof the content of the document. The large amount of textual information, mostly un-
structured and without any semantic description, availableon the Internet requires the development
of tools that help end users to efficiently process such amount of data and quickly search, classify,
and assess natural language documents according to the end users’ needs. The keyword extraction
is a main underlying task of many NLP tools as it is often an important step of many text min-
ing applications, for instance, relevance assessment, index generation, query refinement, document
categorization, named-entity recognition, document similarity measurements, text summarization,
and so on (Beliga, 2014; Siddiqi and Sharan, 2015).
Therefore, in this paper we have selected the keyword extraction problem due to its complexity
and also as a good example from a wide range of NLP applications or their relevant parts.
The following terms are important for the problem description:
rLemma—the canonical form of a set of words, the glossary form. For instance, run is lemma of
run,runs,andran (Radziszewski, 2013).
rLexeme—an abstract unit of lexical meaning, which is shared among the set of all forms taken
by a word with specified meaning. For instance, words book and books are forms of the same
lexeme; words am,be,andwas are also forms of the same lexeme. All forms of a lexeme have
a common canonical form—a lemma. However, all words having a common lemma are not
necessarily forms of the same lexeme, because they might have different meanings. As a lexeme is
C
2017 The Authors.
International Transactionsin Operational Research C
2017 International Federation of OperationalResearch Societies

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT