Many of our customers want to understand the advantage of migrating a data lake or data warehouse to the cloud. Such migration is absolutely the correct approach to create an environment that will support application and analytics needs without constant concerns about increasing storage and compute resources as the company grows and data needs increase.
However, realizing savings and performance benefits from migrating to the cloud typically takes more than a simple “lift and shift” of your as-is architecture. It takes planning and proper design to migrate your data ecosystem to the cloud, and it is critical to understand the available services. In this article, I’ll describe some key misunderstandings we’ve seen during data lake implementations so that you can plan around them.
In most on-premises data lakes, we’ve seen implementations that were used to support all the aspects of a data ecosystem – ETL/ELT, storage, analytics, etc. Typically, this was done using technologies like Hadoop that made use of compute and storage in the data center. In Hadoop-based on-premises data lakes, these types of purchases were sunk capital costs, and it was not a huge financial detriment if some of the horsepower went unused periodically. But, leveraging these on-premises tactics in the cloud has significant downsides.
One of the major issues we see is when customers use data lake technology that relies on high-performance storage. In the data center, it is common to have this type of storage with fiber connections to it. As noted above, these costs are capitalized and, eventually, are fully depreciated. In the cloud, using the highest level of storage performance in large quantities typical of a data lake can create costs that become debilitating to the business. Therefore, we recommend moving data to significantly cheaper storage such as Amazon Simple Storage Service (Amazon S3) and moving the processing to an in-memory approach using Apache Spark. Amazon Elastic MapReduce (Amazon EMR) also supports the use of the EMR File System (EMRFS), which is an implementation of the Hadoop Distributed File System (HDFS) on Amazon S3.
Separating transient computing needs from day-to-day compute requirements may seem like common sense. Yet we’ve seen many cloud environments where the data lake is sized to tackle both the ETL/ELT processing as well as serve as the data warehouse. This situation causes large amounts of compute to be tied up and perpetuates the usage of expensive block storage as described above. Moving data to Amazon S3 allows for the re-architecture of transient Hadoop clusters through the use of Amazon EMR and/or AWS Glue. AWS Glue offers additional benefits in that it is a fully managed serverless solution with a greater reduction in the total cost of ownership (TCO), but that does come with some loss of control and flexibility. The main point, however, is that identification of transient compute needs is critical to leveraging the cloud properly to continue to deliver for the business and keep costs at a minimum.
In a typical Hadoop-based data lake/data warehouse, all the data – both raw and aggregated – tends to sit in the lake using high performing storage. This configuration may be due to the perception that any data being consumed (by either data scientists or for use in typical business intelligence reports/dashboards) must sit in a high-performance warehouse.
In a typical on-premises environment, this leads to a major undertaking to load all the raw data into Hadoop or an enterprise data warehouse (EDW) like Teradata, Exadata, etc. In the cloud, this may result in a large amount of raw data being loaded into a long-running Amazon EMR cluster on Hadoop Distributed File System (HDFS) or a large Amazon Redshift or Snowflake Data Warehouse implementation. While there are instances where this may be appropriate, most consumption use cases only require aggregated data, and raw data is needed in the rarer drill-down scenarios. Also, in cases where raw data is necessary for end-user consumption, there is typically:
These are not 100% guaranteed to be correct assumptions for every use case, but they are valid more times than not. When these cases are valid, we recommend that the aggregations be done as part of the transient computing and loaded into a cheaper relational database management system (RDBMS) — Amazon Relational Database Service (Amazon RDS) can satisfy this nicely. For raw data consumption, we would recommend leveraging the AWS Glue Data Catalog, Amazon Athena, and raw data stored as parquet in Amazon S3. Amazon Athena queries could become expensive if the partitioning strategy used for Amazon S3 does not align with the type of data queries being executed. Partitioning strategy is important for the data lake overall, but probably most critical in terms of cost avoidance when using Amazon Athena.
It is possible to significantly reduce your TCO by moving your data ecosystem to the cloud. However, you need to do it with deliberate planning – consult your cloud provider and engage a partner (like nClouds) that specializes in that cloud provider’s services. The cloud is moving fast, and providers like AWS are adding new services continuously while also improving existing ones. In some cases, it may seem like there’s overlap in services, and it may be unclear which ones to use when (i.e., AWS Glue vs. transient EMR).
nClouds is eager to help you with your data lake initiative and your overall data and analytics strategy. We’ve got the experience, AWS data and analytics how-to knowledge, plus our own research initiatives, to help you plan and execute your strategy.