Python interoperability: building a Python-first, petabyte-scale database | July 17th-23rd 2023

Abstract

How can you scale Python to run at petabyte scale, with the reliability needed to trade billions of dollars? With ArcticDB we have been doing exactly that for the last four years, by leveraging interoperability between Python and high-performance C++, with a detailed understanding of the data structures inside Python and a few extra tricks up our sleeves.

Come take a peek under Python's bonnet and learn how to hotwire a few things along the way.

TalkSoftware Engineering & Architecture (2023)

Description

This talk aims to introduce experienced Pythonistas to the workings of Python's C API and provide you with the means to get started with building Python modules, particularly in modern C++. It's now much easier than you think to write your first Python extension, and despite its reputation for gnarly syntax and lack of memory safety, features that have been gradually added to the C++ language over the last decade such as tuple return types with structured bindings and automatic type deduction, mean that your C++ can be more Pythonic than you might think. Since one of the most common reasons for writing an extension module is to leverage optimized machine code and multi-threading to handle really big data, we'll look at how we avoid shooting ourselves in the performance foot at the boundary between native code and Python.

In this talk we will lay out the potential pitfalls of interacting with the Python C API, share some of the hard-won experience we gained in engineering ArcticDB and running it at petabyte scale for critical Python trading applications. We'll also let you in on some tricks for implementing zero-copy data interchange, managing memory ownership and lifetimes, getting the most out of multi-threading through sensitive handling of the Global Interpreter Lock, and interacting with Pandas.

Of course, nothing stays the same forever. The Python data landscape is changing rapidly and with the advent of Pandas 2.0, the old workhorse NumPy is gradually being supplanted by Apache Arrow, a project that has interoperability at the core of its design. Some things will get easier, some problems will remain. We'll examine the landscape ahead and share vital information to help you future-proof your development both inside and alongside Python.

The speaker

William Dealtry

William Dealtry has been working in both Python and C++ for many years, and has been a member of the C++ standardization committee for more than a decade. Having previously worked with financial data a places like the New York Stock Exchange and Goldman Sachs, he is currently the Architect of a new open-source Dataframe database, ArcticDB, which is backed by long-time Python enthusiasts Man Group and Bloomberg.