User login
@Science Conferences
M. Kohlhase:"Converting arXiv into XHTML+MathML: an opportunity for blind and partially sighted to access scientific papers"
We describe an experiment of transforming large collections of LATEX documents to more machine-understandable representations.
Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arXiv)using the LATEXtoXML converter which is currently under development. The main technical task of our arXMLiv project is to supply LaTeXML bindings for the (thousands of) LATEX classes and packages used in the arXiv collection. For this we have developed a distributed build system that reiteratively runs LaTeXML over the arXiv collection and collects statistics about e.g. the most sorely missing LaTeXML bindings and clusters common error events. This creates valuable feedback to both the developers of the LaTeXML package and to binding implementers. We have now processed the complete arXiv collection of more than 400,000 documents from 1993 until 2006 (one run is a processor-year-size undertaking) and have continuously improved our success rate to more than 56% (i.e. over 56% of the documents that are LATEXhave been converted by LaTeXML without noticing an error and are available as XHTML+MathML documents). Thanks to the availability of readers which XHTML+enable speech, Braille and large character rendering of XHTML+MathML documents, blind and partially sighted readers will have full access to a large collection of scientific papers.
Download the presentation here.
Repository
We Tweet!
Follow the @Science Thematic Network on Twitter!
