Data partitioning: good practices in the design of Data Lakes.
Imagine that you are looking for the book “Crime and Punishment” from Fyodor Dostoevsky in the city library. All you need to do, is simply go to the “International literature” session, if you are not in Russia, and look for the shelve with authors having surnames starting with “D”. This simple query would not be so easy though, if for some strange reasons the librarians decided to divide their sessions on the date of acquisition of the book. In that case, you would not only have any idea where to look for the book, but also, the copies of the same book acquired in different dates would be spread over different shelves. You might have to dedicate a whole year of your life for finding out a single book in this library!
The situation is no different from one of the most common problems that Data Lakes have to address: data tidiness. Clearly, the outcome of long ETL pipelines is stored in a “folder” with a meaningful name, e.g., customers for customer dataset. However, its not unusual that each of these “folders” stores a massive amount of data, thus a good partitioning of its content is essential for making any query on it efficient.
Let’s look at a concrete example. A data scientist is developing a machine learning model to predict the purchasing behavior of customers that live in Vienna, like cats and hate football. If the customer dataset is partitioned by city, animal that I love and sport that I hate, then the data scientist know exactly where to find the data he/she needs. On the other hand, if this dataset is partitioned by, lets say, gender, the application written by the data scientist has no other option but to analyze every single row of the customer dataset for finding out the relevant data.
Clearly, partitioned datasets helps the querying of the data. However, a well partitioned dataset brings advantages far beyond that.
Avoid data shuffling with a good partitioning strategy
A second more realistic example. Imagine that there is a stream for updating the price of articles from an organization and a dataset is created for monitoring which articles had price updates yesterday. An obvious choice here, would be to partition the input dataset by update_date. In this case, the monitoring job has to read only files in a single partition.
What if … a few months later a new job is created for finding out the last price of each of the articles available in the store. In this case, as the dataset is partitioned by update_date, the updates for each of the articles are spread over the date partitions and the job has to query A LOT more of data. Moreover, once each of the partitions has got the latest price update for a given article, the information in each of these partitions has to be shuffled across the nodes of the cluster executing this query, for figuring out which partition has the latest price update. Needless to say that the data shuffling will massively slow down the runtime of the job.
The second job would clearly benefit if the input data is partitioned by article-id. In this case, all updates to each articles are isolated in a single partition, thus the job has only to figure out the row with the latest update timestamp for each of the articles without any data shuffling at all. Even easier, would be if the job is partitioned by article-id/update_date. In this case, all you need to do is a file listing for figuring out when the latest update for each of the articles happened.
Unfortunately, it is hard to decide for a silver bullet partitioning strategy, as future jobs might need to query the data in wildly different ways. Most probably, the engineering team maintaining the Data Lake might have to revisit their data partitioning strategies often.
Downsides of Data partitioning.
Clearly, the more partitions you have in the Data Lake, the larger is the number of files created. However, listing-opening-closing files with data processing frameworks like Spark is notoriously slow, and if there is an excessive number of files, the performance of the job consuming the partitioned dataset might downgrade. An alternative in this case, would be to merge all files of a single partition.
Recently, Databricks has developed Delta Lake, which basically creates a JSON file containing metadata about the dataset, e.g., schema, a transaction log for each write operation to the dataset, partitions, etc. The whole idea behind Delta Lake, is that a job consuming this data can read the metadata to figure out which files it needs o read, thus avoiding expensive listing operations. Most interesting, to avoid an endless nest of partitions for big datasets, Delta Lake on Databricks supports writing some basic statistics for the columns in each of the partitions. Consequently, instead of repartitioning the input dataset in a way that is performant for all jobs consuming it, we could only append some basic statistics for columns that are relevant for the new job to the metadata. Even though the new job would not have a lighting fast performance, it would still partially benefits from the column statistics without having to re-engineer the partitioning of the input dataset.
Good data-partitioning strategies can turn ETL pipelines from raging beasts consuming hours and hours of computing power, into compact jobs that require little or no communication among the processing units. When developing a new Data Platform, or revisiting the architecture of an existing one, the engineers should define good strategies for partitioning all datasets in the Data Lake. The most recent versions of Spark and the new Delta Lake have awesome features that support data partitioning, which should be used/abused by the engineers!