Friday, 19 December 2008

ArXiv Plugin

We have developed an Eprints ArXiv import plugin based on the PubMedID and PubMedXML plugins. This plugin imports the following metadata from ArXiv: item type, article title, abstract, authors’ names, journal title, volume, issue, page number(s), date and DOI. Furthermore it also extracts links to the full text PDF and the ArXiv abstract page (entered in the Official links field). The data is pulled from ArXiv using the ArXiv API.

How does it work?
Initially the author/article needs to be identified uniquely using the arXiv ID. The arXiv API (http://export.arxiv.org/api/query?id_list=arxiv_id) is used to retrieve an XML file containing all the metadata associated with that ID number. This is done with ArXivID.pm and the retrieved XML file is parsed with ArXivXML.pm. The most difficult aspect of the writing this plugin was parsing the “journal_ref” field from the arXiv xml file. This is because it is a free text field containing the journal reference information which is not in a consistent format. To overcome this we installed Biblio::Citation::Parser::Jiao, which extracts the journal title, volume number, issue number, date and the ISSN from a given journal reference. Biblio::Citation::Parser::Jiao can also be installed from CPAN.


PubMed Bug Fix
The PubMedID import plugin was designed for importing multiple IDs however this did not appear to be working and only one item could be imported at a time. This was due to a bug in the code which did not remove the html encoded end of line characters. After removing the end of line characters, the PubMed import plugin works well and can import multiple IDs either from a file or from the text box.

No comments: