Data Hub vs Data Lake vs Data Warehouse: how to choose the right solution?

Illustration de couverture article "Personnalisez votre experience grâce à l'IA

What is the Data Lake?
Reminder and definition of ETL: Extract-transform-load
Understanding the importance of the Data Warehouse
Detailed analysis: the fundamental role of the Data Hub
How can all this be useful to you?
How should you get started?

Talking to data specialists is like spending an evening with doctors. We hear a lot of complicated words whose structure can seem so convoluted that their meaning escapes us completely.

Because Wikipedia often tends to present things in a complete but rough way for understanding, I propose to you a simplified definition of the main terms used in Data, what they imply and how what they represent can serve you and your business. With a little luck, you will also become a connoisseur and will be able to laugh naturally at your Data team's jokes...

To better explain the concepts, let's take a classic situation as an example, that of Alice: a young entrepreneur, dynamic and full of enthusiasm. She recently decided to launch an international video streaming platform. Her users can upload videos, watch them, comment on them... Her success is now worldwide and she is seriously thinking about getting interested in Data, this pseudo black gold as they talk about it...

What is the Data Lake?

The Data Lake is nothing more than a massive (theoretically an indefinite amount) inexpensive storage space for data (whatever their format!) in a "free" organization. To understand simply: your computer's hard disk is literally a Data Lake: it stores a potentially very large amount of raw data and you can explore its structure with your explorer. Its querying performance is usually not extraordinary, and depends on how you will divide the data into directories (also called partitions). Thanks to the great diversity of the data, it will also be the basis of work for the Data Scientists who will for example explore the data, work on Machine Learning, experiment with transformations, ...

“Data from the Data Lake can be in any format: CSV, Excel, binary, Word, PDF, video or audio (MP4/MP3), JPEG, PNG...”

To take our initial example: a data lake can be a good solution for storing videos. Since users upload in large quantities, Alice will need an elastic storage space, whose cost will not explode with this quantity. Because the videos are indexed by user and Alice would like to have a minimum of performance, she will organize her Data Lake with one directory per user, each directory storing all the videos of that user.

Data lake scheme data types — Figure 1. A data lake is able to host any type of data.

“Here are the three key words to remember: massive storage, diverse data, "free" organization.”

Anecdotally, since the Data Lake does not impose any structure by nature, it can turn into a Data Swamp for some people, translating into a mess of data in which order is only a distant memory...

Examples of data lake solutions : AWS S3, Azure Blob Storage, GCP Cloud Storage, Apache Hadoop, …

Reminder and definition of ETL: Extract-transform-load

For ExtractTransformLoad. ETL characterizes the actions of retrieving data from a stream (real-time or not), applying transformations, and then retransmitting it to another location. These operations can correspond to any data-processing function: an aggregation on a temporal window, an enrichment of records, an extraction of a subset of data, ...

3 ESSENTIAL DATA STRATEGIES FOR BANKS AND INSURANCE COMPANIES

Explore how data can be used to optimize your business
Analyze how data can be used to improve your customers' purchasing power
Discover data-driven actions to implement to improve the quality of your services

Download

Let's take Alice's situation again: her platform is used a lot, and for improvement purposes she would like to get some key information. For example, she would like to know in real time how many users are watching a video on her platform every minute that passes.

Data flux to the ETL chain — Figure 2. The ETL retrieves a flow from any source, transforms it, and then loads it to a new destination.

To do this, it will need an ETL chain: its mobile application constantly sends events when its user launches a video. This last chain intervenes by aggregating the events over a one-minute time window. Very simply, it is nothing more than a computer program that will retrieve the events for one minute, count everything and then transmit its result. That's it! A first simple metric that she will be able to reuse in her sublime dashboards.

“Here are the keywords to remember: extraction, transformation, aggregation, time window.”

Examples of ETL solutions :AWS Glue, Azure Data Factory, GCP DataFlow, Apache Spark, …

Understanding the importance of the Data Warehouse

The data warehouse is a form of enhanced or specialized data lake. Everything is well organized, easily accessible, with a very high volume (comparatively lower than the Data Lake, however). You can retrieve very large amounts of data at any time and the information it contains is constantly accumulating. Its main distinction lies in the temporal structuring of its storage and the particular form of its data.

More practically, a data warehouse is actually a database that has been enhanced to be able to easily handle queries that require the analysis of potentially Gigabytes of data. To do this, it distributes the information over several competing disks, allowing it to read several storage sectors at the same time at the slightest request.

In addition to its hardware structure, the denormalized form of its data also allows it to optimize its performance. Indeed, even if it takes up more space, the Data Warehouse will store records where part of the information can be repeated, to avoid costly matching operations that initially save space.

“It's simple: rather than structuring everything perfectly to avoid redundancy, it accepts to be redundant to strongly gain in performance.”

To take our situation again: since Alice likes nice synthetic visualizations, she would like to see the distribution of video views over the year, by content category. To do this, each time one of her users finishes watching a video, an event is sent to her data architecture. An ETL pipeline enriches the data of each recording in order to integrate information about the user and the video (age range for example, the category of the video, ...). All this corresponds to a massive quantity of recordings. They will therefore be integrated into a data warehouse. Alice will then be able to innocently ask her Warehouse for the recordings of the whole year to realize her visual "elementary".

Data warehouse scheme — Figure 3. The Data Warehouse is literally a data warehouse: massive storage, structured and organized data.

To summarize, a data warehouse is simply a huge database that can read structured data from many SSDs concurrently and is optimized for mass querying.

“So here are the keywords to remember: massive storage, optimized data, massive querying.”

Examples of Data Warehousing solutions : AWS RedShift, Azure Synapse Analytics, GCP BigQuery, Apache Hive, Snowflake, …

Detailed analysis: the fundamental role of the Data Hub

The Data Hub is a virtual space in which it is possible to reference and query all your data sources. Its objective is nothing more than to centralize the information that your systems are capable of providing you with. Its advantage lies in its capacity to know the structure (or not) of the data of each source, and the possibility of querying it in order to obtain information quickly.

Let's go back to Alice's company, as feisty as the originality of its concept: its data is located in a plurality of places. The Video Data Lake, the business database (users and video information), the real-time event data warehouse, but also (for example) the files of each country's standards on what it authorizes to broadcast or not, its CRM referencing interested advertisers, the Google Analytics of your application's usage, ... . All this represents a potential great diversity of sources, which the Data Hub is there to list. From now on, Alice will be able to automatically know from the CRM customer information the quantity of videos that would correspond to his expectations, and the mass of users likely to be interested by them.

“Here are the keywords to remember: centralization, referencing, querying, set of data sources.”

Examples of Data Hubs :Azure Purview, Data Hub, …

How can all this be useful to you?

Every structure generates data by necessity. Whether it comes from the business domain, from interactions with terminals, or even from the Internet of Things. All this data is in essence a source of value. In our example of video streaming, knowing more about user habits, but also the main metrics of the platform in general, would allow us to obtain precise indicators that are essential for the future of the business. Whatever the associated remuneration model, knowing more about your business can only be beneficial in a Data Driven approach to continuous improvement.

How should you get started?

Everything in its own time! You don't need to start a huge data warehouse and hundreds of ETL jobs right away.

A Data Lake to start with

The most important thing is to collect the data you generate in order to make an exploratory study. The objective is, for example, to create or use connectors to export your data to a Data Lake, without planning to work on it for the moment. This way, you will store a history of knowledge at low cost, which will then be ready to be used.

A study to explore

Once enough data has been collected (it depends on you: three days, several weeks, a few months), it's up to you and your teams to look at what's going on. Using tools like PowerBI, Apache Superset, QLik (business intelligence tools) or simply a Python/SQL Notebook, you will be able to try to put information together, transform it on the fly and look for relationships that make sense for your business. The idea is simply to create the blueprint that will allow you to add value to the data you generate. Don't force yourself to use everything you could use, and instead use data that is really going to benefit you on a daily basis.

“The objective is to move you towards a data-driven vision, a Data Driven vision.”

Let's take an example: You can (from now on) export all your users' interactions to the AWS Data Lake, S3, on a daily basis. After a month, you can connect a Business Intelligence tool (PowerBI, Quicksight) to this Data Lake so that you can query it live and create new business from your data without delay!

What's next ?

If your business generates small amounts of data, it is likely that you have already reached the right level of optimization between data valuation and infrastructure costs. That said, for more comfort or if you want to go further, you can continue by creating ETL jobs that will export the data you have selected (by structuring it) to your brand new data warehouse. The same Business Intelligence tools will prefer to connect to it rather than to the Data Lake, and will benefit from much improved performance.

Once your data pipelines are well defined, maybe it's time to think about using your data for something else than Business Intelligence. For example, make your data available to your users in Open Data, or create your own Artificial Intelligence and Machine Learning models. In short, the world is open to you...

9 min

Thank you for reading this article.