An open-source implementation of a Data Lake with DuckDB and AWS Lambdas. In this post we will show how to build a simple end-to-end application in the cloud on a serverless infrastructure. The purpose is simple: we want to show that we can develop directly against the cloud while minimizing the cognitive overhead of designing and building infrastructure. By Ciro Greco.
DuckDB is an open-source in-process SQL OLAP database built specifically for analytical queries. It is somewhat still unclear how much DuckDB is actually used in production, but for us today the killer feature is the possibility of querying parquet files directly in S3 with SQL syntax. As data practitioners we want (and love) to build applications on top of our data as seamlessly as possible. Whether you work in BI, Data Science or ML all that matters is the final application and how fast you can see it working end-to-end. The infrastructure often gets in the way though.
This tutorial then describes:
- Architecture
- Your first query engine + data lake from spare parts
- (Almost) free analytics
- A few remarks on the “Reasonable Scale”
In this post, we showed that the combination of data-first storage formats, on-demand compute and in-memory OLAP processing opens up for new possibilities at Reasonable Scale. Repository with the relevant code and architecture explanation are also provided. Interesting read!
[Read More]