Skip to content

Analytics⚓︎

Overview⚓︎

This document provides a comprehensive reference for Amazon Web Services (AWS) analytics services, focusing on Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (MSK), AWS Glue, Amazon EMR, Amazon Sagemaker, Amazon Athena, S3/Glacier Select, Amazon Quicksight, and Amazon Managed Grafana.


Amazon Kinesis⚓︎

General Information⚓︎

  • Platform: Amazon Kinesis
  • Description: A platform for streaming data, offering easy loading, analysis, and customization of applications for diverse business needs.
  • Key Features:
    • Real-time, streaming data processing
    • Classic or enhanced fan-out consumers
    • Access via Virtual Private Cloud (VPC)
    • Identity and Access Management (IAM) access for users/groups
  • Types:
    • Kinesis Data Streams
    • Kinesis Data Firehose
    • Kinesis Analytics
    • Kinesis Video Streams (for capturing, processing, and storing video streams)

Kinesis Data Streams⚓︎

  • Modes:
    • On-demand capacity or Provisioned mode
  • Parallel Consumers: Up to 5
  • Replication: Synchronous replication across 3 Availability Zones (AZ) in a single Region
  • Storage Duration: Between 24 hours and 365 days in shards
  • Message Size Limit: 1 MB
  • Encryption: TLS in flight or KMS at-rest encryption
  • Output to:
    • Kinesis Data Firehose
    • Kinesis Data Analytics
    • Containers
    • AWS Lambda
    • AWS Glue

Amazon Kinesis Data Firehose⚓︎

  • Type: Fully Managed (serverless) service
  • Scalability: Automatic scaling, no administration required
  • Real-time Processing: Minimum 60 seconds latency for non-full batches
  • Data Size: Minimum 1 MB
  • Subscription: Can subscribe to Simple Notification Service (SNS)
  • Destinations:
    • S3
    • Amazon Redshift
    • Amazon Elastic Search
    • Custom destinations (HTTP/S endpoint)

Amazon Kinesis Analytics⚓︎

  • Type: Fully Managed (serverless)
  • Compatibility: Utilizes Kinesis Data Streams or Kinesis Data Firehose
  • For SQL Applications:
    • Input/Output: Kinesis Data Streams or Kinesis Data Firehose
  • For Apache Flink:
    • Input: Kinesis Data Stream or Amazon MSK
    • Output: Sink (S3/Kinesis Data Firehose)

Amazon Managed Streaming for Apache Kafka (MSK)⚓︎

  • Alternative to: Amazon Kinesis
  • Type: Fully managed Apache Kafka on AWS
  • Features:
    • Multi-AZ deployment in VPC (up to 3 for high availability)
    • Automatic recovery from common Apache Kafka failures
    • Serverless option
    • Default message size: 1 MB (configurable)
    • Supports plaintext, TLS in-flight, or KMS at-rest encryption
  • Consumers:
    • Kinesis Data Analytics for Apache Flink
    • AWS Glue
    • Streaming ETL Jobs (Apache Spark Streaming)
    • AWS Lambda, EC2/ECS/EKS

AWS Glue⚓︎

  • Type: Managed ETL service (fully serverless)
  • Event Driven: Lambda-triggered by S3 put object
  • Components:
    • Glue Data Catalog (crawls DBs/S3/data for metadata)
    • Glue Job Bookmarks (prevents reprocessing old data)
    • Glue Databrew (clean/normalize data using pre-built transformations)
    • Glue Studio (GUI for creating, running, and monitoring ETL jobs)
    • Streaming ETL (compatible with Kinesis Data Streaming, Kafka, MSK)
    • Glue Elastic Views (combine and replicate data across multiple data stores using SQL)

Amazon EMR⚓︎

  • Type: Service to create Hadoop clusters for big data analysis
  • Supported Technologies: Apache Spark, HBase, Presto, Flink, etc.
  • Cluster Types: Long-running or transient (temporary)
  • Node Types:
    • Master Node (cluster management)
    • Core Node (tasks and data storage)
    • Task Node (optional, for running tasks)
  • Purchasing Options:
    • On-demand
    • Reserved
    • Spot Instances

Amazon Sagemaker⚓︎

  • Type: Fully managed service for ML model development/data science
  • Process Includes:
    • Labeling data
    • Training and tuning models
    • Serving API traffic against the models

Amazon Athena⚓︎

  • Type: Serverless query service for analyzing and querying data in S3 using standard SQL
  • Optimizations: Compress data for smaller retrieval, use target files (> 128 MB)
  • Cost: $5.00 per TB scanned
  • Integration: Commonly used with Amazon Quicksight
  • Federated Query: Allows SQL queries across various data sources

S3/Glacier Select⚓︎

  • Queries: Simple SQL queries (no joins)
  • Input: Glacier Select input is a CSV file with an S3 Select Statement

Amazon Quicksight⚓︎

  • Type: BI/analytics serverless ML service for interactive visualizations and ad-hoc analysis
  • Integration: Integrates with Amazon RDS
  • Features:
    • In-memory computation using Spice Engine
    • Column-Level Security (CLS)
    • Sharing of analysis or dashboards with users/groups

Amazon Managed Grafana⚓︎

  • Type: Managed service for data visualizations and analysis
  • Features:
    • Analyzing, monitoring, setting alarms on metrics, logs, and traces
    • Integration into shareable dashboards across multiple data sources