Here at nClouds, we often work with fast-growth clients that require analytics of real-time streaming data. They need a modern data and analytics architecture that:
Sound familiar?
Following is an example of how I implemented such a solution with one of our clients, to cut operational costs by more than 50% (actually, by an impressive 72% in this case), and improve the efficiency and performance of their data and analytics system.
Our client, a fast-growth, AI-powered mediatech that uses streaming social media, wanted to reduce their data analytics costs without sacrificing performance. They selected nClouds to help them based on the findings of an AWS Well-Architected Review we did with them and our AWS technical expertise in data and analytics.
The client’s original architecture was a traditional Cloudera data and analytics solution using Hadoop and block storage. We collaborated with them to implement a modern, simplified, cloud-native Apache Spark data and analytics architecture — using Amazon Athena for access and Amazon Simple Storage Service (Amazon S3) object storage — designed to reduce operational costs and improve efficiency and performance based on best practices.
Why Apache Spark instead of Hadoop? And why Amazon S3 object storage instead of block storage?
The client was interested in moving to a fully managed Platform-as-a-Service (PaaS) model from an Infrastructure-as-a-Service (IaaS) model, and going serverless (i.e., a Function as a Service, or FaaS, model).
Note: FaaS is a simple, event-based architecture that triggers the execution of a function, where the cost is based on usage (you’re charged only for the resources consumed when the code runs). When FaaS is connected to PaaS, as in this case, we are combining functions with microservices.
Amazon API Gateway | Pay only when your APIs are in use. |
Amazon Athena | Pay only for the queries you run, based on the amount of data scanned by each query. Significant cost savings and performance gains can be attained by compressing, partitioning, or converting your data to a columnar format, reducing the amount of data that Athena needs to scan to execute a query. |
Amazon Comprehend | Pay only for what you use, based on the amount of text processed monthly. |
Amazon DynamoDB | Pay for reading, writing, and storing data in your DynamoDB tables, along with any optional features you choose to enable. There are two modes:
|
Amazon ElastiCache | Get started with this managed caching service for free – the AWS Free Usage tier includes 750 hours per month of a t1.micro or t2.micro node. After that, pay only for what you use:
|
Amazon Elasticsearch Service | Pay only for what you use – instance hours, Amazon EBS storage (if you choose this option), and data transfer. |
Amazon Kinesis Data Firehose | Pay only for the volume of data you ingest into the service and, if applicable, for data format conversion. It can save on storage costs by batching, compressing, and transforming data before loading it to minimize the amount of storage used at the destination. |
Amazon Simple Storage Service (Amazon S3) | Pay only for what you use. AWS charges less where their costs are less, based on the location of your S3 bucket. |
AWS Fargate | Pay only for the vCPU and memory resources that your containerized application uses. |
AWS Glue | Pay an hourly rate, billed by the second, for crawlers (discovering data) and ETL jobs (processing and loading data). If you provision a development endpoint to interactively develop your ETL code, pay an hourly rate, billed per second. For the AWS Glue Data Catalog, pay a monthly fee for storing and accessing the metadata – the first million objects stored are free, and the first million accesses are free. |
AWS Lambda | Pay only for what you use, based on the number of requests for your functions and the time it takes for your code to execute. |
Below is the architecture I created to perform real-time data analytics on streaming social media feeds.
In our architecture for this client, when AWS Lambda kicks in, it calls Amazon Comprehend – a machine learning-powered service that uses natural language processing (NLP) to find insights and relationships in unstructured data – and other processes to enrich the data. To power the User Interface (UI), it writes to the Amazon S3 Enriched bucket.
If you need to perform real-time analytics of streaming data, want to reduce your data analytics costs without sacrificing performance, and are currently using a Cloudera-based architecture, consider shifting to an architecture based on AWS-native services. The use case described above is just one of many customers nClouds has partnered with that have realized the benefits of such a modern, simplified data analytics architecture:
If you want to read more about the related customer story, check out the case study with this AI innovator and mediatech.
Need help with data analytics on AWS? The nClouds team is here to help with that and all your AWS infrastructure requirements.
Top takeaways: AWS Managed Microsoft AD and Microsoft Active Directory
2022-12-05 15:25:16Improve global application availability and performance with AWS Global Accelerator.
2022-10-31 19:30:05