Too Big for DAG Factories?

Level:: intermediate
Room:: south hall 2b
Start:: 11:20 on 21 July 2023
Duration:: 30 minutes

Abstract

Do you need to transform, optimize and scale your data workflow? In this talk, we’ll review use cases, and you’ll learn how to dynamically generate thousands of DAGs (Directed Acyclic Graphs) with Airflow.

TalkPyData: Data Engineering

Description

You’re working on a project that needs to aggregate petabytes of data, and it doesn’t make sense to manually hard-code thousands of tables, DAGs (Directed Acyclic Graphs) and pipelines. How can you transform, optimize and scale your data workflow? Developers around the world (especially those who love Python) are using Apache Airflow — a platform created by the community to programmatically author, schedule and monitor workflows without limiting the scope of your pipelines.

In this talk, we’ll review use cases, and you’ll learn best practices for how to:

use Airflow to transfer data, manage your infrastructure and more;
implement Airflow in practical use cases, including as a:
- workflow controller for ETL pipelines loading big data;
- scheduler for a manufacturing process; and/or
- batch process coordinator for any type of enterprise;
scale and dynamically generate thousands of DAGs that come from JSON configuration files;
automate the release of both the DAGs and infrastructure updates via a CI/CD pipeline;
run all tasks simultaneously using Airflow.

Both beginner and intermediate developers will benefit from this talk, and it is ideal for developers wanting to learn how to use Airflow for managing big data. Beginners will learn about dynamic DAG factories, and intermediate developers will learn how to scale DAG factories to thousands of DAGS — which is something Airflow can’t do out of the box.

After this talk and live demo, people will learn best practices (including access to a code repo) that will allow them to scale to thousands of DAGs and spend more time having fun with big data.

The speaker

Calvin Hendryx-Parker

Calvin Hendryx-Parker is the co-founder and CTO of Six Feet Up, a Python and cloud expert consulting company that makes the world a better place by using technology to accelerate the initiatives of companies that do good.

Calvin’s Massive Transformative Purpose (MTP) is to inspire and enable tech leaders to open minds and bring the world together for a sustainable future. At Six Feet Up, Calvin establishes the company's technical vision and leads all aspects of the company's technology development. He provides the strategic vision for enhancing the offerings of the company and infrastructure, and he works with the team to set company priorities and implement processes that will help improve product and service development.

Calvin is passionate about the open source community and specializes in app development, AI, big data and cloud technology. He is regularly sought after to share his expertise — both at international conferences and in the media.

In 2019, Calvin was named an AWS Hero — one of only 48 Heroes in North America. He is the founder and host of the Python Web Conference; the co-founder of IndyPy, the largest Python meetup in Indiana with 2,100+ members; and the founder of IndyAWS, the fastest growing cloud meetup in the state with 800+ members. Additionally, Calvin is the driving force behind LoudSwarm by Six Feet Up, a high impact virtual event platform that debuted in June 2020.

Outside of work, Calvin spends time tinkering with new devices like the AWS DeepRacer, CircuitPython and Raspberry Pi. He is an avid distance runner and ran the 2014 NYC Marathon to support the Innocence Project. Before the pandemic, Calvin and his family enjoyed annual extended trips to France where his wife Gabrielle, the CEO of Six Feet Up, is from. Calvin lives in Fishers, IN and holds a Bachelor of Science from Purdue University.

← Back to schedule