Gaming Analytics: Leveraging AWS Glue and Amazon Redshift Spectrum for Player Insights

TutoSartup excerpt from this article:
Introduction In the dynamic landscape of game development, efficient data management and analysis are pivotal for optimizing player experiences and driving business growth… This scenario involves joining player data from Amazon Aurora MySQL with player statistics stored in Amazon Redshift… Th…

Introduction

In the dynamic landscape of game development, efficient data management and analysis are pivotal for optimizing player experiences and driving business growth. Game developers and analysts often encounter the challenge of aggregating data from diverse sources, ranging from real-time operational metrics to historical analytical records. To address these challenges, AWS provides a robust suite of services. Analytics services include AWS Glue for data preparation and transformation purposes. Additionally, Amazon Redshift Spectrum supports seamless querying of data across data warehouses and data lakes. This article explores the integration of AWS Glue and Amazon Redshift Spectrum to streamline the process of joining operational and analytical data for gaming analytics. By leveraging these services, game developers can extract valuable insights from disparate data sources while minimizing development effort and operational costs.

A cloud computing architecture diagram showing Amazon RDS for MySQL connected with an error to AWS Glue which is connect with an arrow to Amazon S3 (parquet data) which is connected to Amazon Redshift Spectrum (External Tables). The Redshift Spectrum icon has an arrow to Amazon Redshift Cluster (tables) icon and another arrow to View (reports, Dashboard, etc.)

Redshift Spectrum and AWS Glue setup requirements

To illustrate this integration, you’ll use Amazon Aurora MySQL-Compatible Edition for operational data and Amazon Redshift for analytical data storage. This scenario involves joining player data from Amazon Aurora MySQL with player statistics stored in Amazon Redshift. Before diving into the implementation, you’ll step-through prerequisite set-up, including the creation of Amazon Virtual Private Cloud (Amazon VPC) endpoints, appropriate AWS Identity and Access Management (IAM) roles, and download of a database driver for connectivity.

Security prerequisites

  1. Add a self-referential rule to the Amazon Aurora MySQL security group.

A screenshot of the EC2 security group rules showing a self-referential rule. The Source of the rule is highlighted to show the security group ID being added for the referenced security group.

  1. Create an AWS Glue role to call other AWS services.

A screenshot of the MyGlueServiceRole and the AWSGlueServiceRole highlighted to show inclusion in the custom role.

  1. Create a Redshift Spectrum role to allow Amazon Redshift to call other AWS services. The Amazon Redshift CREATE EXTERNAL SCHEMA command uses this role.

A screenshot of the MyRedshiftSpectrumRole and the AWSGlueServiceRole highlighted to show inclusion in the custom role. Also, an arrow indicated selecting the 'Create inline policy' option.

  1. Add an inline policy to MyRedshiftSpectrumRole to allow actions for Amazon Simple Storage Service (S3), AWS Glue and AWS Lake Formation.
    a. Choose the Permissions tab, Add permissions and Create inline policy.
    b. Under Specify permissions, toggle JSON and paste the below policy in Policy editor.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:Get",
        "s3:List"
      ],
      "Resource": ""
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateDatabase",
        "glue:DeleteDatabase",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:UpdateDatabase",
        "glue:CreateTable",
        "glue:DeleteTable",
        "glue:BatchDeleteTable",
        "glue:UpdateTable",
        "glue:GetTable",
        "glue:GetTables",
        "glue:BatchCreatePartition",
        "glue:CreatePartition",
        "glue:DeletePartition",
        "glue:BatchDeletePartition",
        "glue:UpdatePartition",
        "glue:GetPartition",
        "glue:GetPartition",
        "glue:BatchGetPartition",
        "lakeformaDon:GetDataAccess"
      ],
      "Resource": [
        ""
      ]
    }
  ]
}
  1. c. Select Next and Create policy.

Endpoint prerequisites

  1. To allow AWS Glue access to Amazon S3 from within an Amazon VPC, create an Amazon VPC Gateway Endpoint for Amazon S3.

AWS console screenshot of VPC s3 endpoint showing the name of the endpoint and highlighting the type of endpoint: gateway.

  1. AWS Glue requires an Amazon VPC interface endpoint to utilize a JDBC connection and is needed to activate networking between AWS Glue and Amazon Aurora MySQL.

AWS console screenshot of VPC Glue endpoint showing the name of the endpoint and highlighting the type of endpoint: interface.

JDBC driver prerequisite

1. For the final prerequisite step, download and store the latest MySQL driver in Amazon S3. The JDBC connection requires the driver for AWS Glue to crawl the Amazon Aurora MySQL table.

2. To download the connector, choose mysql-connector-j-8.3.0.zip

3. Create an Amazon S3 bucket to host the preceding driver and upload the driver jar to the bucket. No bucket policy is needed for access as the AWS Glue service role created above has necessary read permissions.

Operational and analytics data stores

  1. Queries use an Amazon Aurora MySQL table:

The schema of the player table: Id INT(11) player_id BIGINT(20), community_id INT(11), created_ts DATETIME, updated_ts DATETIME,

  1. And an Amazon Redshift provisioned cluster table:

Redshift provisioned cluster table schema: player_id BIGINT, total_seconds_played DOUBLE PRECISION, total_session_cnt BIGINT, total_payment_cnt BIGINT, total_payment_amount DOUBLE PRECISION

With prerequisites in place and data stores defined, let’s revisit our objectives: extracting operational data from the Amazon Aurora MySQL, conducting transformations by eliminating unnecessary columns, and loading the data into Amazon S3 with date format partitions for seamless querying via Redshift Spectrum. The extract, transform, and load (ETL) operations must be repeatable to accommodate recurring reporting needs. AWS Glue provides built-in Apache Spark and Python environments for executing transformation jobs, and it handles data connectors and workflow orchestration capabilities. The strategy requires deploying connectors, crawlers, jobs, and a workflow to prepare the data for integration with Redshift tables.

Data extraction and transformation with AWS Glue

  1. First, define data sources in AWS Glue by creating crawlers. These crawlers will scan the Amazon Aurora MySQL instance and the data stored in Amazon S3, updating the AWS Glue Data Catalog with schema and partition information.2. Create AWS Glue connectors for Amazon Aurora MySQL and Amazon S3 to allow AWS Glue crawlers to connect to the database instance for data extraction and write data to S3.  Add the Amazon S3 bucket hosting the MySQL connector jar file and use: com.mysql.cj.jdbc.Driver for ‘Driver class name’.

Screenshot of AWS Glue connection for aurora mysql with name of connection, Aurora MySQL Connection, highlighted. Also highlighted: Connection URL Driver class name: com.mysql.cj.jdbc.Driver Driver path to s3 bucket with connector jar file

  1. To create a connector for reading objects in Amazon S3 and updating the AWS Glue Data Catalog with transformed schema and new partitions, create an AWS Glue network connection.

Screenshot of AWS Glue connection for S3 with connector type : network highlighted

Configure AWS Glue data crawlers

  1. Define two crawlers to extract data from Amazon Aurora MySQL and catalog data written to Amazon S3. The rds_player_db_crawler uses the JDBC connection and identifies the playerDB database and player table as data source.

Screenshot of AWS Glue console crawler configuration with crawler name "rds_player_crawler" highlighted, data source tab pointed out with arrow and the data source type and datasource highlighted: Type: JDBC Data Source: playerDB/player

  1. The s3_player_db_crawler uses the previously created network connection to support crawling of Amazon S3 objects and updating the AWS Glue Data Catalog with table and partition metadata.

Screenshot of AWS Glue console crawler configuration with crawler name "s3_player_db_crawler" highlighted, data source tab pointed out with arrow and the data source type and datasource highlighted: Type: S3 Data Source: s3 bucket (obscured) slash player-db/player Parameters: Recrawl all

Transform data with AWS Glue jobs

1. Next, create an AWS Glue job to transform the operational data extracted from Amazon Aurora MySQL. The job involves dropping redundant columns, formatting data, and writing the transformed data to Amazon S3 in compressed Apache Parquet format. Additionally, you’ll generate a timestamp parameter to facilitate partitioning for optimized query performance and cost efficiency.

2. Create a transform_player_table job using the Apache Spark runtime environment and the Aurora MySQL connection.  Portions of transform_player_table were generated using Amazon Q data integration in AWS Glue.

import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from datetime import datetime

# Convert job runtime to timestamp format, format for partition query
job_run_timestamp = datetime.now().strftime("year=%Y/month=%m/day=%d/hour=%H/minute=%M")
runtime_params_string = format(job_run_timestamp.replace("/", ","))

# Set s3 partition path
output_path = f"s3://EXAMPLE-BUCKET/player-db/player/{job_run_timestamp}/"

# Create glue session
glueContext = GlueContext(SparkContext.getOrCreate())

# Get runtime parameters
glue_client = boto3.client("glue")
args = getResolvedOptions(sys.argv, ["JOB_NAME", "WORKFLOW_NAME", "WORKFLOW_RUN_ID"])
workflow_name = args["WORKFLOW_NAME"]
workflow_run_id = args["WORKFLOW_RUN_ID"]
workflow_params = glue_client.get_workflow_run_properties(
    Name=workflow_name, RunId=workflow_run_id
)["RunProperties"]

# Set runtime paramters with formated timestamp
workflow_params["job_run_timestamp"] = runtime_params_string
glue_client.put_workflow_run_properties(
    Name=workflow_name, RunId=workflow_run_id, RunProperties=workflow_params
)

# Create a frame for player table
playerdb_player = glueContext.create_dynamic_frame.from_catalog(
    database="default", table_name="playerdb_player"
)

# Drop unnecessary columns
playerdb_player = playerdb_player.drop(columns=["id"])

# Write frame to s3 in compressed parquet with partitioned path
glueContext.write_dynamic_frame.from_options(
    frame=players,
    connection_type="s3",
    connection_options={"path": output_path, "region": "us-west-2"},
    format="parquet",
    format_options={"compression": "SNAPPY"},
)

Screenshot of AWS Glue job details with name: transform_player_table highlighted. Following are selected as configuration choices: IAM role: MyGlueServiceRole Type: Spark Glue Version: Glue 4.0 Language: Python 3 Worker type: G 1X
  1. The job writes the operational data to Amazon S3 with timestamped prefixes in compressed Parquet format. The Amazon S3 crawler run updates the AWS Glue data catalog with new partition metadata.  This partitioning scheme supports reporting requirements and optimizes queries for cost and performance.  When querying the external Redshift Spectrum table, constraining by partition will reduce the size of data scanned and costs associated with Redshift Spectrum usage.

AWS Glue workflows for data pipeline orchestration

  1. With the data extraction and transformation processes defined, orchestrate the workflow using AWS Glue workflows (orchestration). By creating a workflow, you automate the execution of AWS Glue crawlers and job, ensuring a seamless and repeatable process for preparing data for analysis.

Screenshot of AWS Glue console Workflows page for player_db_to_datawarehouse workflow. Key for default run properties is highlighted with value of job_run_timestamp. The graph section of the page shows a partial view of the workflow including: Start: player_db_snapshot crawler Next step: rds_player_db_crawler Next step: crawler completed Next step: transform_player_tables

Redshift Spectrum setup

Before querying the data, set up Redshift Spectrum to access data stored in Amazon S3. This involves creating an external schema in Amazon Redshift that mirrors the schema of the transformed data stored in Amazon S3.

1. The external schema command requires a schema name, a data catalog name, region, and the ‘MySpectrumRole’ created in the prerequisite steps.

CREATE EXTERNAL SCHEMA IF NOT EXISTS player_stats_s
FROM DATA CATALOG DATABASE 'default'
IAM_ROLE 'arn:aws:iam::111111111111:role/MySpectrumRole'
REGION 'us-west-2'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

  1. For the last Redshift Spectrum step, GRANT USAGE on the external schema to required users.

GRANT USAGE on SCHEMA player_stats_s to PUBLIC

Query data with Redshift Spectrum

Once the integration is complete, query the transformed dataset using Amazon Redshift’s SQL capabilities. By leveraging Redshift Spectrum, we can query data stored in Amazon S3 alongside data in an Amazon Redshift cluster, supporting powerful analytics and reporting capabilities.

Select all available partitions:

SELECT schemaname, tablename, VALUES, location FROM svv_external_partitions;

Select all rows from a specific partition:

SELECT * FROM player_stats_s.playerdb_player
WHERE (year = '2024' and month = '3' and day = '14' and hour = '14' and minute = '0');

Select all rows joining the Amazon Redshift provisioned cluster table and the Redshift Spectrum external table for a specific partition:

SELECT COUNT(*)
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0');

With data available from both data sources, we can join Amazon Redshift provisioned cluster tables and Redshift Spectrum external tables. Let’s query to find players and communities with the highest number of seconds played:

SELECT DISTINCT playerdb_player.player_id, player_stats_s.playerdb_player.community_id, playerdb_player.total_seconds_played
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0')
ORDER BY playerdb_player.total_seconds_played DESC;

Lastly, communities with the highest payment amounts:

SELECT player_stats_s.playerdb_player.community_id, playerdb_player.player_id, SUM(player_stats.total_payment_amount) AS total_payment_amount
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0')
GROUP BY player_stats_s.playerdb_player.community_id
ORDER BY total_payment_amount DESC;

Cleaning up

You’ve now successfully created an AWS Glue workflow to join operational and analytics data. To avoid ongoing charges for resources you created following the steps detailed in this article, you should delete:

  1. Amazon Aurora MySQL database and table use as an operational data source.
  2. The Amazon Simple Storage Service (s3) buckets use for the MySQL JDBC driver and AWS Glue job output location.
  3. The Amazon VPC interface and gateways endpoints.
  4. The AWS Glue crawlers, job and workflow.
  5. The Amazon Redshift provisioned cluster table and Redshift Spectrum external table.

Conclusion

AWS Glue and Redshift Spectrum provide game developers and analysts with a robust platform for combining, transforming, and analyzing data from disparate sources. By automating the extract, transform, and load (ETL) processes with AWS Glue, organizations can optimize costs and operational efficiency. Leveraging the querying capabilities of Amazon Redshift Spectrum, they can also derive actionable insights from their data.

In the fast-paced world of game development, where data-driven decisions are paramount, the integration of AWS Glue and Redshift Spectrum offers a scalable and cost-effective solution. This integration unlocks the full potential of gaming analytics by providing a powerful combination of data processing and querying capabilities. By harnessing the power of these AWS services, game developers can gain deeper insights into player behavior, drive engagement, and ultimately, deliver exceptional gaming experiences.

Gaming Analytics: Leveraging AWS Glue and Amazon Redshift Spectrum for Player Insights
Author: Steve Phillips