Solving Data Problems in Management Accounting

Level:: beginner
Room:: south hall 2b
Start:: 14:00 on 20 July 2023
Duration:: 30 minutes

Abstract

Controllers deal with numbers all day long. They have to check a lot of data from different sources. Often the reports contain erroneous or missing data. Identifying outliers and suspicious data is time-consuming.

This presentation will introduce a Small Data Problem-End2End workflow using statistical tools and machine learning to make controllers' jobs easier and help them be more productive.

We will demonstrate how we used amongst others,

scipy
pandera
dirty cat
nltk
fastnumbers to create a self-improving system to automate the screening of reports and report outliers in advance so that they can be eliminated more quickly.

TalkPyData: Data Engineering (2023)

Description

This presentation will introduce a Small Data Problem-End2End workflow using statistical tools and machine learning to make controllers' jobs easier and help them be more productive.

It is a common business problem that the data provided is incorrect due to misunderstandings, manual input, cultural differences, typos, etc. and these errors can often be weeded out in short order.

This talk will show how heuristic data validation can help to facilitate - in our specific use case controller - automated detection of inaccuracies, outliers or input errors.

In our use case we have to deal with a lot of reports. Some of the reports contain hundreds of columns and are very individually structured. Defining data types and expectations for each individual report for each column would be too time-consuming. We are dealing with a technically manageable number of data sets, but too many to leave to human visual control alone.

In our talk we will present strategies on how we have solved small data problems using heuristic and statistical methods.

Questions to tackle:

Are None values ok or not, and if - why?
Is a value an outlier or a typo?
How much deviation is ok, or not?
How can historical data help and to what extent?
Which other external information can help to validate data?

We will demonstrate how we used amongst others,

scipy
pandera
dirty cat
nltk
fastnumbers to create a self-improving system to automate the screening of reports and report outliers in advance so that they can be eliminated more quickly.

Audience: This presentation is intended for anyone interested in data quality management without heavy lifting. Especially small data problems.

The speakers

Alexander CS Hendorf

Alexander Hendorf is responsible for data and artificial intelligence at the boutique consultancy KÖNIGSWEG in Germany. He has many years of experience in the practical application, introduction and communication of data and AI-driven strategies and decision-making processes. Through his commitment as a speaker and chair of various international conferences as PyConDE & PyData Berlin, he is a proven expert in the field of data intelligence. He's been appointed Python Software Foundation and EuroPython fellow for this various contributions. Currently he is sitting board member of Python Software Verband (Germany) and the EuroPython Society (EPS).

Lucas-Raphael Müller

← Back to schedule