Analytics⚓︎
Overview⚓︎
This document provides a comprehensive reference for Amazon Web Services (AWS) analytics services, focusing on Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (MSK), AWS Glue, Amazon EMR, Amazon Sagemaker, Amazon Athena, S3/Glacier Select, Amazon Quicksight, and Amazon Managed Grafana.
Amazon Kinesis⚓︎
General Information⚓︎
- Platform: Amazon Kinesis
- Description: A platform for streaming data, offering easy loading, analysis, and customization of applications for diverse business needs.
- Key Features:
- Real-time, streaming data processing
- Classic or enhanced fan-out consumers
- Access via Virtual Private Cloud (VPC)
- Identity and Access Management (IAM) access for users/groups
- Types:
- Kinesis Data Streams
- Kinesis Data Firehose
- Kinesis Analytics
- Kinesis Video Streams (for capturing, processing, and storing video streams)
Kinesis Data Streams⚓︎
- Modes:
- On-demand capacity or Provisioned mode
- Parallel Consumers: Up to 5
- Replication: Synchronous replication across 3 Availability Zones (AZ) in a single Region
- Storage Duration: Between 24 hours and 365 days in shards
- Message Size Limit: 1 MB
- Encryption: TLS in flight or KMS at-rest encryption
- Output to:
- Kinesis Data Firehose
- Kinesis Data Analytics
- Containers
- AWS Lambda
- AWS Glue
Amazon Kinesis Data Firehose⚓︎
- Type: Fully Managed (serverless) service
- Scalability: Automatic scaling, no administration required
- Real-time Processing: Minimum 60 seconds latency for non-full batches
- Data Size: Minimum 1 MB
- Subscription: Can subscribe to Simple Notification Service (SNS)
- Destinations:
- S3
- Amazon Redshift
- Amazon Elastic Search
- Custom destinations (HTTP/S endpoint)
Amazon Kinesis Analytics⚓︎
- Type: Fully Managed (serverless)
- Compatibility: Utilizes Kinesis Data Streams or Kinesis Data Firehose
- For SQL Applications:
- Input/Output: Kinesis Data Streams or Kinesis Data Firehose
- For Apache Flink:
- Input: Kinesis Data Stream or Amazon MSK
- Output: Sink (S3/Kinesis Data Firehose)
Amazon Managed Streaming for Apache Kafka (MSK)⚓︎
- Alternative to: Amazon Kinesis
- Type: Fully managed Apache Kafka on AWS
- Features:
- Multi-AZ deployment in VPC (up to 3 for high availability)
- Automatic recovery from common Apache Kafka failures
- Serverless option
- Default message size: 1 MB (configurable)
- Supports plaintext, TLS in-flight, or KMS at-rest encryption
- Consumers:
- Kinesis Data Analytics for Apache Flink
- AWS Glue
- Streaming ETL Jobs (Apache Spark Streaming)
- AWS Lambda, EC2/ECS/EKS
AWS Glue⚓︎
- Type: Managed ETL service (fully serverless)
- Event Driven: Lambda-triggered by S3 put object
- Components:
- Glue Data Catalog (crawls DBs/S3/data for metadata)
- Glue Job Bookmarks (prevents reprocessing old data)
- Glue Databrew (clean/normalize data using pre-built transformations)
- Glue Studio (GUI for creating, running, and monitoring ETL jobs)
- Streaming ETL (compatible with Kinesis Data Streaming, Kafka, MSK)
- Glue Elastic Views (combine and replicate data across multiple data stores using SQL)
Amazon EMR⚓︎
- Type: Service to create Hadoop clusters for big data analysis
- Supported Technologies: Apache Spark, HBase, Presto, Flink, etc.
- Cluster Types: Long-running or transient (temporary)
- Node Types:
- Master Node (cluster management)
- Core Node (tasks and data storage)
- Task Node (optional, for running tasks)
- Purchasing Options:
- On-demand
- Reserved
- Spot Instances
Amazon Sagemaker⚓︎
- Type: Fully managed service for ML model development/data science
- Process Includes:
- Labeling data
- Training and tuning models
- Serving API traffic against the models
Amazon Athena⚓︎
- Type: Serverless query service for analyzing and querying data in S3 using standard SQL
- Optimizations: Compress data for smaller retrieval, use target files (> 128 MB)
- Cost: $5.00 per TB scanned
- Integration: Commonly used with Amazon Quicksight
- Federated Query: Allows SQL queries across various data sources
S3/Glacier Select⚓︎
- Queries: Simple SQL queries (no joins)
- Input: Glacier Select input is a CSV file with an S3 Select Statement
Amazon Quicksight⚓︎
- Type: BI/analytics serverless ML service for interactive visualizations and ad-hoc analysis
- Integration: Integrates with Amazon RDS
- Features:
- In-memory computation using Spice Engine
- Column-Level Security (CLS)
- Sharing of analysis or dashboards with users/groups
Amazon Managed Grafana⚓︎
- Type: Managed service for data visualizations and analysis
- Features:
- Analyzing, monitoring, setting alarms on metrics, logs, and traces
- Integration into shareable dashboards across multiple data sources