This is the website for an older EuroPython. Looking for the latest EuroPython? Click here!
Skip to main content

High Volume PDF Text Extraction using Python Open-Source Tools

Level:
intermediate
Room:
south hall 2a
Start:
Duration:
30 minutes

Abstract

All major companies have huge amounts of (mostly PDF) documents that contain important - even critically important - information, that does no longer exist anywhere else in their data stores.

Reports, once generated for shareholders and legal or financial authorities, may still be useful for developing longterm forecasts or triggering company management decisions.

By definition, documents are intended for human perception, and as such contain unstructured data from an information technology perspective.

Therefore, tools to extract PDF text content (mostly, but not only text) from millions of pages have become important vehicles to recreate structured information.

This presentation talks about extraction "need for speed" in this Big Data scenario, the need for integration with OCR capabilities and presents an open-source toolset which combines both, top-of-the-class performance and maximum extraction detail.

TalkPyData: Data Engineering

The speaker

Harald Lieder

Harald Lieder

Master / Diploma in Mathematics (Frankfurt, Germany) Worked for large companies in management positions (Technology Architecture, data center lead) and for a global consulting firm as programme lead for large scale technology projects (infrastructure migration, data center optimization, post-merger integration). Decades of experience in programming, systems programming, quality management.


← Back to schedule