Solving Data Problems in Management Accounting
- Level:
- beginner
- Room:
- south hall 2b
- Start:
- Duration:
- 30 minutes
Abstract
Controllers deal with numbers all day long. They have to check a lot of data from different sources. Often the reports contain erroneous or missing data. Identifying outliers and suspicious data is time-consuming.
This presentation will introduce a Small Data Problem-End2End workflow using statistical tools and machine learning to make controllers' jobs easier and help them be more productive.
We will demonstrate how we used amongst others,
- scipy
- pandera
- dirty cat
- nltk
- fastnumbers to create a self-improving system to automate the screening of reports and report outliers in advance so that they can be eliminated more quickly.
Description
Controllers deal with numbers all day long. They have to check a lot of data from different sources. Often the reports contain erroneous or missing data. Identifying outliers and suspicious data is time-consuming.
This presentation will introduce a Small Data Problem-End2End workflow using statistical tools and machine learning to make controllers' jobs easier and help them be more productive.
It is a common business problem that the data provided is incorrect due to misunderstandings, manual input, cultural differences, typos, etc. and these errors can often be weeded out in short order.
This talk will show how heuristic data validation can help to facilitate - in our specific use case controller - automated detection of inaccuracies, outliers or input errors.
In our use case we have to deal with a lot of reports. Some of the reports contain hundreds of columns and are very individually structured. Defining data types and expectations for each individual report for each column would be too time-consuming. We are dealing with a technically manageable number of data sets, but too many to leave to human visual control alone.
In our talk we will present strategies on how we have solved small data problems using heuristic and statistical methods.
Questions to tackle:
- Are None values ok or not, and if - why?
- Is a value an outlier or a typo?
- How much deviation is ok, or not?
- How can historical data help and to what extent?
- Which other external information can help to validate data?
We will demonstrate how we used amongst others,
- scipy
- pandera
- dirty cat
- nltk
- fastnumbers to create a self-improving system to automate the screening of reports and report outliers in advance so that they can be eliminated more quickly.
Audience: This presentation is intended for anyone interested in data quality management without heavy lifting. Especially small data problems.