Apache Arrow and Substrait, the secret foundations of Data Engineering | July 17th-23rd 2023

Abstract

Apache Arrow, and its Python library PyArrow are becoming the standard de facto for transfering data and interoperability between libraries and languages. As more compute engines, storages and databases start to speak arrow, you might be relying on it without even knowing. The same transformation is happening with Substrait, that is on track to be the standard representation of query plans themselves. Allowing queries to be routed to different engines as far as they speak substrait, or even decomposed and forwarded to different engines. This talk we will provide a quick introduction to the Arrow ecosystem, showing to Python developers how libraries like Pandas, Polars and PyArrow itself leverage Arrow and how compute engines like Velox, Datafusion and Acero are embracing Arrow and Substrait. The talk will also show how a basic database system based on Arrow and Substrait can be built with a minimum amount of code thanks to all the foundations they provide.

TalkPyData: Data Engineering (2023)

The speaker

Alessandro Molina

Python developer since 2001, has been relying on Python as his primary development language for more than 20 years.

He worked as CTO and Director of Engineering with Python teams for the past 10 years and is currently Senior Director of Open Source Engineering

Alessandro has been the core developer of the TurboGears2 web framework and maintainer of Beaker Caching/Session framework. Is currently a contributor to the Apache Arrow project and has also authored many OpenSource Python projects like the DEPOT file storage framework and the DukPy JavaScript interpreter for Python and collaborated with projects related to the Python web development ecosystem.

Alessandro is also the author of the Crafting Test-Driven Software with Python and Modern Python Standard Library Cookbook books.