Building Successful Modern Data Analytics Platform in the Cloud
I urge you to examine your beliefs and plans for a data analytics platform, as based on my experience from dozens of similar projects, what might seem as simple in the beginning, often turns out to be complex in time.
If you are planning to attend the coming AWS summit in San Francisco, you are invited to attend my session on the topic of data analytics platform, and you can skip reading this article. Otherwise, I urge you to examine your beliefs and plans for a data analytics platform, as based on my experience from dozens of similar projects, what might seem as simple in the beginning, often turns out to be complex in time.
Why do you need multiple data tiers?
The main assumption that I want you to examine is regarding building the platform around a single tool. The tool can be conceptual as Data Warehouse or Data Lake, or it can be specific to a product such as Snowflake, Databricks, Amazon Redshift, Azure Synapse, or SAP HANA. The different vendors of these tools are saying that their product can solve all or most of the data use cases such as ETL, SQL engine, BI tools connection, and even as a machine learning platform. It sounds great as you only need to pay for a single license, train your people on a single tool, create only a single security mechanism and all your data analytics problems are solved. Sadly, it is not so easy and like many other types of problems, breaking the big problem to smaller ones yields a better solution.
The data in your data analytics platform has three main forms, (1) raw data from multiple sources, (2) enriched data for different analyses, and (3) processed data for users’ consumption. The data in each one of these forms is different in its size, structure, access requirements, etc., and therefore, you should store it in different tiers and process it with different tools.
What are the three data tiers?
The data in Tier 1 (L1 in the diagram above) is big as it is coming from multiple data sources (such as transactions in relational databases, events from your users, IoT metrics, system logs, etc.) and in raw form such as JSON, XML, text, etc. It is almost impossible to harmonize all that data and put it into a specific schema that will describe everything about it. Even when the amount of data was smaller a few years ago, most data warehouse projects failed due to this complexity, which these days is even more complicated.
The data in Tier 2 (L2 above) is needed to build an analytical model or system for a specific set of business problems. It doesn’t need to touch and use all the data in Tier I, as most of it is not relevant to the business problem. However, for the data that is relevant for any specific analysis, it must be processed and enriched significantly to be able to build a meaningful output (machine learning model, business intelligence analysis and graphs, etc.) that will benefit the business people. Therefore, Tier 2 has multiple copies of parts of the data from Tier 1 that were processed intelligently by the data engineers, scientists and analysts.
The data in Tier 3 is only meant to be consumed by human users and therefore, it is mostly small (compared to big) data. It is already aggregated to the level that the business people care about (daily or monthly, compared to seconds or even milliseconds in Tier 1 or Tier 2). The data is also enriched and analyzed by the data experts that processed it in Tier 2.
What do you get from multiple data tiers?
There are many benefits to the multiple tiers approach:
- Security - The best security is by isolation. If I have access to an environment with the data, I might be able to access that data, either by mistake of permission or using malicious hacks. Most of the security breaches are caused by a user that had access to the data and made a mistake that allowed others to access it. With the multiple tiers approach, the administrators that can access the data in Tier 1 are very few, and the many users that can access the data in Tier 3 have very little data to access. The same goes for the data experts who access only their pieces of data that they need for the analysis. It is much easier to control the access to the environment than to the data within it.
- Cost & Performance - When using a one-size-fits-all approach that puts all the data in a single data store, you find it hard to balance between lowering the cost of the storage or the compute that is needed for the different use cases, and the performance that is expected. A few data experts that are querying big data are different from thousands of business users that only need small pieces of data. It is true that all the big data providers are trying to solve this problem with separation between storage and compute, with different work groups and queues, etc. Nevertheless, they can’t make it cost effective compared to choosing the right tool for each task. For example, AWS S3 can be a good storage solution for Tier 1, less so for Tier 3. On the other hand, RDS can be a good solution for queries in Tier 3, however, not so much for Tier 1 or Tier 2.
- Responsibility - When working in a large organization with many data sources and systems, there are also multiple IT teams that are focused on different parts of the IT platform. A usual split is between Data & BI systems and Business Applications. The multiple Tiers approach makes it easy to give each team the responsibility on its part of the platform, and only coordinate on the flows between the tiers. Another aspect of this split is the ability to bring in external data experts (such as data scientists) to work independently on some of the analyses, by curving them the relevant pieces of that data and the cloud environment and resources they need to work on it.
Summary and further action
In this short post, we described the main concepts and reasons for a multiple tiers approach to data analytics platforms in the cloud. If you want to read a bit more about these topics you are welcome to read the following articles: Building a Successful Modern Data Analytics Platform in the Cloud, Architecting a Successful Modern Data Analytics Platform in the Cloud, and Securing a Successful Modern Data Analytics Platform in the Cloud.
You are also most welcome to join our team in aiOla or to consult with us about your data analytics journey. We would love to help you navigate the many confusing and rapidly changing tools and technologies.