Private Data Anonymization with Python, Fundamentals

Level:: intermediate
Room:: terrace 2a
Start:: 15:30 on 20 July 2023
Duration:: 30 minutes

Abstract

How to bring large legal document repositories into the public domain without releasing private data? The fundamental concepts behind document anonymization are entity recognition, masking type, and pseudoanonymization. Using python language and a collection of libraries such as spacy, pytorch, and others we can achieve good scores of anonymization. How is this applied within a flow containing AI models for NER? Once anonymized how to improve the result by doing more text mining with python based apps and human in the loop. Although it was approved in 2016, the application of the GDPR at the European level remains a challenge in banking, legal, and other contexts. This talk covers the process of transforming pdf and docx documents into xml, processing them using regexp and spacy/torch models, and how to parse these results using AntConc and Textacy. All the ideas will be supported with the real experience of the MAPA project a European project for anonymization finished in 2022.

TalkPyData: Deep Learning, NLP, CV (2023)

Description

Based on the experience of more than 3 years anonymizing documents in different domains, the idea is to present the necessary steps in the anonymization process and how python tools are essential for it.

It will include the presentation of a European project in the field of anonymization completed in 2022 whose data is available to the entire community and which is known as MAP (https://www.elrc-share.eu/repository/search/? q=MAP)

The talk will focus its objectives on the importance of AI models to scale anonymization in environments with high volumes of documents, and how python technologies make possible a better performance of the solution and also of the team that develops it.

The following frameworks will be mentioned in the presentation: Spacy, Pytorch, FastAPI, Textacy, Pytest and other base libraries.

The speakers

Abel Meneses Abad

Currently Data Scientist at Datwit US, Inc.

Private Health Data Anonymization
Health Data Analysis
AI adoption in Health and Legal domains Machine Learning Engineer at Pangeanic (until May 2023)
AWS solution design based on async requests
Anonymization Toolkit using NER with customized tags with Python
Combining Machine Translation with Pseudoanonymization
REST API design for NLP tasks with FastAPI, Flair and Spacy
Flair and Spacy NER Model Evaluation
Image Anonymization Service design and evaluation
OCR service design and evaluation

Oscar L. Garcell

Passionate MLOps and Backend Developer. Currently living and working in Belgrade, Serbia.

Work Experience:

HTEC Group, Software Engineer
Datwit US LLC, Data Analyst
Pangeanic Language Technologies and Translation Services, Machine Learning Engineer
Auge CRM S.A.S. de C.V, Back End Developer & DevOps Engineer
OptimalBit LLC, DevOps Engineer

← Back to schedule