Abstract:
Keyphrases facilitate in finding right information from digital documents.
Keyphrase assignment is the alignment of document or text with the keyphrases
of any standard classification taxonomy. Kea++ is a famous tool for performing
keyphrase assignment automatically; however it assigns irrelevant terms along
with the relevant ones. In order to reduce noise in the Kea++ result set, refinement
rules were defined in the refinement methodology to exploit the semantics
of the hierarchical structure of the taxonomy. This methodology is a top layer
on Kea++. It was evaluated on computing domain taxonomy and showed better
results than Kea++.
However the refinement methodology is more focused on computing domain taxonomy
and does not perform well in case of taxonomies having deep hierarchy of
keyphrases. Training-level is the hierarchical level of taxonomy which is adopted in
manually generated keyphrases for documents in the training dataset of Kea++.
In refinement methodology, the training-level is the key parameter for selection
or rejection of any keyphrase in Kea++ result set. But its selection process does
not offer priority to the taxonomy level where maximum keyphrases are aligned
in the training dataset. Moreover, the methodology lacks in applying standard
terminology used in taxonomy languages.
This work is aimed to extend and generalize the refinement methodology for multiple
domains and improve its results. In the proposed extended refinement methodology,
the training-level selection process has been revised and due consideration
has been given to taxonomies having deep hierarchy of keyphrases. Standard terminology
used in taxonomy languages has been adopted and amended the refinement
methodology accordingly to be practical in multiple domains.
The extended refinement methodology was evaluated on three different domain
taxonomies and datasets: computing, agriculture and mathematics. Evaluation
metrics used were (i) precision, recall and f-measure (ii) average number of assigned
keyphrases to test documents and (iii) statistical t-test. The evaluation
demonstrates significant improvement in reducing noise in the Kea++ result set
for multiple domains. We conclude that the extended refinement methodology has
been generalized and can be applied in domains other than computing. It has also
shown better results than its predecessor.