The overall parsing results varied greatly, with F1 ranging from 0.27 for Citation-Parser to 0.89 for GROBID. Our results also showed that different tools have different strengths and weaknesses. For example, ParsCit is ranked 3rd in the overall ranking but is best for extracting author names. Science Parse, ranked 4th overall, is best in extracting the year. These results suggest that there is no single best parser. Instead, different parsers might give the best results for different metadata types and different reference strings. Consequently, we hypothesize that if we were able to accurately choose the best parser for a given scenario, the overall quality of the results should increase. This can be seen as a typical recommendation problem: a user (e.g. #Teambeam metadata github softwareĪ software developer or a researcher) needs the item (reference parser) that satisfies the user‘s needs best (high quality of metadata fields extracted from reference strings). In this paper we propose ParsRec, a novel meta-learning recommender system for bibliographic reference parsers. ParsRec takes as input a reference string, identifies the potentially best reference parser(s), applies the chosen parser(s), and outputs the metadata fields. ParsRec is built upon ten open-source parsers mentioned before. ParsRec uses supervised machine learning to recommend the best parser(s) for the input reference string. The novel aspects of ParsRec are: 1) considering reference parsing as a recommendation problem, 2) using a meta learning-based hybrid approach for reference parsing. This paper is an extended version of a poster published at the 12 th ACM Conference on Recommender Systems 2018 (RecSys). Reference parsers often use regular expressions, hand-crafted rules, and template matching (Biblio, Citation, Citation-Parser, PDFSSA4MET, and BibPro ). Typically the most effective approach for reference parsing is supervised machine learning, such as Conditional Random Fields (ParsCit, GROBID, CERMINE, Anystyle-Parser, Reference Tagger and Science Parse ), or Recurrent Neural Networks combined with Conditional Random Fields (Neural ParsCit ). To the best of our knowledge, all open-source reference parsers are based on a single technique, none of them uses any ensemble, hybrid or meta-learning techniques. Some reference parsers are parts of larger systems for information extraction from scientific papers. These systems automatically extract machine-readable information, such as metadata, bibliography, logical structure, or fulltext, from unstructured documents. Examples include PDFX, ParsCit, GROBID, CERMINE, Icecite and Team-Beam. Meta-learning is a technique often applied to the problem of algorithm selection .
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |