Currently, SQL and Cloud Data Warehouses (DWH) are extremely popular for good reason. They are great for dashboarding and business intelligence (BI) use cases due to their ease-of-use. However, their combination might not be the best choice for every problem. More precisely, business-critical data pipelines with high complexity might be better suited for frameworks such as Apache Spark which greatly benefit from the tight integration with general purpose languages like Python (e.g., PySpark).
Expect an opinionated comparison between Apache Spark and seemingly easier-to-use cloud native SQL engines. By the end of this talk, you will be challenged to think about why they are complementary and when each has its justification.
This talk is intended for data engineers, solution architects and alike who deal with distributed computation engines for analytical batch workloads in their day to day work.
While SQL is the lingua franca for data analysis, it might not be the best choice for every problem. Business-critical data pipelines with high complexity demand for appropriate abstractions, proper testability and enhanced debugging tooling that is hard to achieve with SQL and surrounding frameworks. In contrast, using the power of Python in conjunction with frameworks like Apache Spark is a promising solution for the following reasons:
- Apache Spark’s DataFrame API is embedded in a general purpose programming language like Python which allows to leverage software engineering concepts such as inheritance, composition, introspection and higher order functions. This provides data engineers a toolbox of abstractions to properly handle complexity. More precisely, it allows to structure, modularize, reuse, and dynamically generate building blocks of data pipelines.
- To ensure semantic correctness, data engineers can develop unit tests on individual transformations or integration tests on entire pipelines in classical software development tradition which is hardly possible with cloud-native SQL engines.
- Apache Spark's future-proof is ensured due to its wide adoption, runtime flexibility and vendor independence.
The talk will guide you through each of the claims stated above using a real world IoT use case while comparing the strengths and weaknesses of both Apache Spark and cloud-native SQL engines.