Apache Arrow and Substrait, the secret foundations of Data Engineering
- 30 minutes
Apache Arrow, and its Python library PyArrow are becoming the standard de facto for transfering data and interoperability between libraries and languages. As more compute engines, storages and databases start to speak arrow, you might be relying on it without even knowing. The same transformation is happening with Substrait, that is on track to be the standard representation of query plans themselves. Allowing queries to be routed to different engines as far as they speak substrait, or even decomposed and forwarded to different engines. This talk we will provide a quick introduction to the Arrow ecosystem, showing to Python developers how libraries like Pandas, Polars and PyArrow itself leverage Arrow and how compute engines like Velox, Datafusion and Acero are embracing Arrow and Substrait. The talk will also show how a basic database system based on Arrow and Substrait can be built with a minimum amount of code thanks to all the foundations they provide.