M. Ramšak1, B. Kaučič2, M. Marolt1

1University of Ljubljana, Faculty of Computer and Information Science (SLOVENIA)
2University of Ljubljana, Faculty of Education (SLOVENIA)
Number of electronic resources rises daily. In parallel to that, number of resource collections, digital libraries and repositories is increasing. Quality of them depends also from how well resources are indexed and how similar words, synonyms etc. are considered in the searching algorithm. Basis for that are appropriate keywords (sometimes referred as keyphrases), and their extraction process as one of the tasks in resource management. Apart from that, keywords have many additional useful applications.
Several algorithms and tools have been reported in literature about keyword extraction. Basically they can be divided into approaches based on natural language processing, machine learning and combinations of them. Output of algorithms is a set of the top best candidates for keywords. In general, algorithms work in two phases: preparing the list of keyword candidates in the first phase, and cleaning and ordering that list based on keyword features in the second phase. Undoubtedly, efficiency of the second phase depends on the efficiency of the first phase. In the first phase, many of them use phrase boundaries as one of the filters limiting the number of keyword candidates.

In this paper, several file converters are considered and how their output influences the keyword extraction. In addition we observe how consolidated text without supplementary text influences it. For the keyword extraction, the freely available Kea tool is used. In the evaluation, a collection of PDF documents is used, and the results of extractions are compared against different file converters, consolidated texts, and against manually (by authors of documents) given keywords. Information retrieval metrics precision and recall are used.