A project is underway to convert PDF to XML with high accuracy by complementing existing tools with machine learning.

A vast trove of scientific research is locked inside the PDF format. Being able to extract and store this data in a more accessible and reusable format such as XML opens up countless possibilities for the use of this data.

ScienceBeam, a project originally announced by eLife in August 2017, initially investigated using computer vision algorithms to help ‘see’ the structure of a research paper in PDF as a human would. This would then be used to assign the correct metadata to the document’s content. 

As part of the project, we have been collaborating with other publishers who share this common goal and were working on similar tools. A semantic extraction group was formed to discuss the needs for the project and where efforts would be best focused. This resulted in the decision to put the effort into one tool.

Collaborating with other publishers has also allowed us to collate a broad corpus of PDFs with variations in font, layout and content. These PDFs, alongside their XML ‘pair’, have been used to help train and test our neural networks to be able to extract good metadata.

Having access to this wide variety of papers and formats has helped our system learn to deduce the structure of a research paper, and ScienceBeam is now successfully being used within Libero Reviewer to lift the title directly from the uploaded manuscript file. Automated data extraction will also be incorporated further into the full workflow, saving as much time for scientists as possible.

The project’s next steps are working to improve the accuracy of extraction, ahead of further integrations with the Libero suite of tools.

Visit eLife to read more about the background of this project.

A better way to publish is here.

Driven by you, built by the community, flexible and modern.