How to bring large legal document repositories into the public domain without releasing private data? The fundamental concepts behind document anonymization are entity recognition, masking type, and pseudoanonymization. Using python language and a collection of libraries such as spacy, pytorch, and others we can achieve good scores of anonymization. How is this applied within a flow containing AI models for NER? Once anonymized how to improve the result by doing more text mining with python based apps and human in the loop. Although it was approved in 2016, the application of the GDPR at the European level remains a challenge in banking, legal, and other contexts. This talk covers the process of transforming pdf and docx documents into xml, processing them using regexp and spacy/torch models, and how to parse these results using AntConc and Textacy. All the ideas will be supported with the real experience of the MAPA project a European project for anonymization finished in 2022.
Based on the experience of more than 3 years anonymizing documents in different domains, the idea is to present the necessary steps in the anonymization process and how python tools are essential for it.
It will include the presentation of a European project in the field of anonymization completed in 2022 whose data is available to the entire community and which is known as MAP (https://www.elrc-share.eu/repository/search/? q=MAP)
The talk will focus its objectives on the importance of AI models to scale anonymization in environments with high volumes of documents, and how python technologies make possible a better performance of the solution and also of the team that develops it.
The following frameworks will be mentioned in the presentation: Spacy, Pytorch, FastAPI, Textacy, Pytest and other base libraries.