[HTML payload içeriği buraya]
30.9 C
Jakarta
Monday, November 25, 2024

Simplify knowledge streaming ingestion for analytics utilizing Amazon MSK and Amazon Redshift


In direction of the tip of 2022, AWS introduced the overall availability of real-time streaming ingestion to Amazon Redshift for Amazon Kinesis Knowledge Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), eliminating the necessity to stage streaming knowledge in Amazon Easy Storage Service (Amazon S3) earlier than ingesting it into Amazon Redshift.

Streaming ingestion from Amazon MSK into Amazon Redshift, represents a cutting-edge strategy to real-time knowledge processing and evaluation. Amazon MSK serves as a extremely scalable, and absolutely managed service for Apache Kafka, permitting for seamless assortment and processing of huge streams of knowledge. Integrating streaming knowledge into Amazon Redshift brings immense worth by enabling organizations to harness the potential of real-time analytics and data-driven decision-making.

This integration lets you obtain low latency, measured in seconds, whereas ingesting lots of of megabytes of streaming knowledge per second into Amazon Redshift. On the similar time, this integration helps guarantee that probably the most up-to-date data is available for evaluation. As a result of the mixing doesn’t require staging knowledge in Amazon S3, Amazon Redshift can ingest streaming knowledge at a decrease latency and with out middleman storage value.

You’ll be able to configure Amazon Redshift streaming ingestion on a Redshift cluster utilizing SQL statements to authenticate and hook up with an MSK subject. This resolution is a wonderful choice for knowledge engineers that need to simplify knowledge pipelines and scale back the operational value.

On this put up, we offer an entire overview on the way to configure Amazon Redshift streaming ingestion from Amazon MSK.

Resolution overview

The next structure diagram describes the AWS companies and options you may be utilizing.

architecture diagram describing the AWS services and features you will be using

The workflow consists of the next steps:

  1. You begin with configuring an Amazon MSK Join supply connector, to create an MSK subject, generate mock knowledge, and write it to the MSK subject. For this put up, we work with mock buyer knowledge.
  2. The subsequent step is to hook up with a Redshift cluster utilizing the Question Editor v2.
  3. Lastly, you configure an exterior schema and create a materialized view in Amazon Redshift, to devour the info from the MSK subject. This resolution doesn’t depend on an MSK Join sink connector to export the info from Amazon MSK to Amazon Redshift.

The next resolution structure diagram describes in additional element the configuration and integration of the AWS companies you may be utilizing.
solution architecture diagram describing in more detail the configuration and integration of the AWS services you will be using
The workflow consists of the next steps:

  1. You deploy an MSK Join supply connector, an MSK cluster, and a Redshift cluster inside the non-public subnets on a VPC.
  2. The MSK Join supply connector makes use of granular permissions outlined in an AWS Identification and Entry Administration (IAM) in-line coverage connected to an IAM function, which permits the supply connector to carry out actions on the MSK cluster.
  3. The MSK Join supply connector logs are captured and despatched to an Amazon CloudWatch log group.
  4. The MSK cluster makes use of a {custom} MSK cluster configuration, permitting the MSK Join connector to create matters on the MSK cluster.
  5. The MSK cluster logs are captured and despatched to an Amazon CloudWatch log group.
  6. The Redshift cluster makes use of granular permissions outlined in an IAM in-line coverage connected to an IAM function, which permits the Redshift cluster to carry out actions on the MSK cluster.
  7. You need to use the Question Editor v2 to hook up with the Redshift cluster.

Stipulations

To simplify the provisioning and configuration of the prerequisite assets, you should use the next AWS CloudFormation template:

Full the next steps when launching the stack:

  1. For Stack identify, enter a significant identify for the stack, for instance, stipulations.
  2. Select Subsequent.
  3. Select Subsequent.
  4. Choose I acknowledge that AWS CloudFormation would possibly create IAM assets with {custom} names.
  5. Select Submit.

The CloudFormation stack creates the next assets:

  • A VPC custom-vpc, created throughout three Availability Zones, with three public subnets and three non-public subnets:
    • The general public subnets are related to a public route desk, and outbound visitors is directed to an web gateway.
    • The non-public subnets are related to a non-public route desk, and outbound visitors is shipped to a NAT gateway.
  • An web gateway connected to the Amazon VPC.
  • A NAT gateway that’s related to an elastic IP and is deployed in one of many public subnets.
  • Three safety teams:
    • msk-connect-sg, which might be later related to the MSK Join connector.
    • redshift-sg, which might be later related to the Redshift cluster.
    • msk-cluster-sg, which might be later related to the MSK cluster. It permits inbound visitors from msk-connect-sg, and redshift-sg.
  • Two CloudWatch log teams:
    • msk-connect-logs, for use for the MSK Join logs.
    • msk-cluster-logs, for use for the MSK cluster logs.
  • Two IAM Roles:
    • msk-connect-role, which incorporates granular IAM permissions for MSK Join.
    • redshift-role, which incorporates granular IAM permissions for Amazon Redshift.
  • A {custom} MSK cluster configuration, permitting the MSK Join connector to create matters on the MSK cluster.
  • An MSK cluster, with three brokers deployed throughout the three non-public subnets of custom-vpc. The msk-cluster-sg safety group and the custom-msk-cluster-configuration configuration are utilized to the MSK cluster. The dealer logs are delivered to the msk-cluster-logs CloudWatch log group.
  • A Redshift cluster subnet group, which is utilizing the three non-public subnets of custom-vpc.
  • A Redshift cluster, with one single node deployed in a non-public subnet inside the Redshift cluster subnet group. The redshift-sg safety group and redshift-role IAM function are utilized to the Redshift cluster.

Create an MSK Join {custom} plugin

For this put up, we use an Amazon MSK knowledge generator deployed in MSK Join, to generate mock buyer knowledge, and write it to an MSK subject.

Full the next steps:

  1. Obtain the Amazon MSK knowledge generator JAR file with dependencies from GitHub.
    awslabs github page for downloading the jar file of the amazon msk data generator
  2. Add the JAR file into an S3 bucket in your AWS account.
    amazon s3 console image showing the uploaded jar file in an s3 bucket
  3. On the Amazon MSK console, select Customized plugins underneath MSK Join within the navigation pane.
  4. Select Create {custom} plugin.
  5. Select Browse S3, seek for the Amazon MSK knowledge generator JAR file you uploaded to Amazon S3, then select Select.
  6. For Customized plugin identify, enter msk-datagen-plugin.
  7. Select Create {custom} plugin.

When the {custom} plugin is created, you will note that its standing is Lively, and you’ll transfer to the following step.
amazon msk console showing the msk connect custom plugin being successfully created

Create an MSK Join connector

Full the next steps to create your connector:

  1. On the Amazon MSK console, select Connectors underneath MSK Join within the navigation pane.
  2. Select Create connector.
  3. For Customized plugin kind, select Use present plugin.
  4. Choose msk-datagen-plugin, then select Subsequent.
  5. For Connector identify, enter msk-datagen-connector.
  6. For Cluster kind, select Self-managed Apache Kafka cluster.
  7. For VPC, select custom-vpc.
  8. For Subnet 1, select the non-public subnet inside your first Availability Zone.

For the custom-vpc created by the CloudFormation template, we’re utilizing odd CIDR ranges for public subnets, and even CIDR ranges for the non-public subnets:

    • The CIDRs for the general public subnets are 10.10.1.0/24, 10.10.3.0/24, and 10.10.5.0/24
    • The CIDRs for the non-public subnets are 10.10.2.0/24, 10.10.4.0/24, and 10.10.6.0/24
  1. For Subnet 2, choose the non-public subnet inside your second Availability Zone.
  2. For Subnet 3, choose the non-public subnet inside your third Availability Zone.
  3. For Bootstrap servers, enter the record of bootstrap servers for TLS authentication of your MSK cluster.

To retrieve the bootstrap servers to your MSK cluster, navigate to the Amazon MSK console, select Clusters, select msk-cluster, then select View consumer data. Copy the TLS values for the bootstrap servers.

  1. For Safety teams, select Use particular safety teams with entry to this cluster, and select msk-connect-sg.
  2. For Connector configuration, exchange the default settings with the next:
connector.class=com.amazonaws.mskdatagen.GeneratorSourceConnector
duties.max=2
genkp.buyer.with=#{Code.isbn10}
genv.buyer.identify.with=#{Title.full_name}
genv.buyer.gender.with=#{Demographic.intercourse}
genv.buyer.favorite_beer.with=#{Beer.identify}
genv.buyer.state.with=#{Tackle.state}
genkp.order.with=#{Code.isbn10}
genv.order.product_id.with=#{quantity.number_between '101','109'}
genv.order.amount.with=#{quantity.number_between '1','5'}
genv.order.customer_id.matching=buyer.key
international.throttle.ms=2000
international.historical past.data.max=1000
worth.converter=org.apache.kafka.join.json.JsonConverter
worth.converter.schemas.allow=false

  1. For Connector capability, select Provisioned.
  2. For MCU depend per employee, select 1.
  3. For Variety of staff, select 1.
  4. For Employee configuration, select Use the MSK default configuration.
  5. For Entry permissions, select msk-connect-role.
  6. Select Subsequent.
  7. For Encryption, choose TLS encrypted visitors.
  8. Select Subsequent.
  9. For Log supply, select Ship to Amazon CloudWatch Logs.
  10. Select Browse, choose msk-connect-logs, and select Select.
  11. Select Subsequent.
  12. Overview and select Create connector.

After the {custom} connector is created, you will note that its standing is Working, and you’ll transfer to the following step.
amazon msk console showing the msk connect connector being successfully created

Configure Amazon Redshift streaming ingestion for Amazon MSK

Full the next steps to arrange streaming ingestion:

  1. Hook up with your Redshift cluster utilizing Question Editor v2, and authenticate with the database consumer identify awsuser, and password Awsuser123.
  2. Create an exterior schema from Amazon MSK utilizing the next SQL assertion.

Within the following code, enter the values for the redshift-role IAM function, and the msk-cluster cluster ARN.

CREATE EXTERNAL SCHEMA msk_external_schema
FROM MSK
IAM_ROLE '<insert your redshift-role arn>'
AUTHENTICATION iam
CLUSTER_ARN '<insert your msk-cluster arn>';

  1. Select Run to run the SQL assertion.

redshift query editor v2 showing the SQL statement used to create an external schema from amazon msk

  1. Create a materialized view utilizing the next SQL assertion:
CREATE MATERIALIZED VIEW msk_mview AUTO REFRESH YES AS
SELECT
    "kafka_partition",
    "kafka_offset",
    "kafka_timestamp_type",
    "kafka_timestamp",
    "kafka_key",
    JSON_PARSE(kafka_value) as Knowledge,
    "kafka_headers"
FROM
    "dev"."msk_external_schema"."buyer"

  1. Select Run to run the SQL assertion.

redshift query editor v2 showing the SQL statement used to create a materialized view

  1. Now you can question the materialized view utilizing the next SQL assertion:
choose * from msk_mview LIMIT 100;

  1. Select Run to run the SQL assertion.

redshift query editor v2 showing the SQL statement used to query the materialized view

  1. To watch the progress of data loaded by way of streaming ingestion, you possibly can benefit from the SYS_STREAM_SCAN_STATES monitoring view utilizing the next SQL assertion:
choose * from SYS_STREAM_SCAN_STATES;

  1. Select Run to run the SQL assertion.

redshift query editor v2 showing the SQL statement used to query the sys stream scan states monitoring view

  1. To watch errors encountered on data loaded by way of streaming ingestion, you possibly can benefit from the SYS_STREAM_SCAN_ERRORS monitoring view utilizing the next SQL assertion:
choose * from SYS_STREAM_SCAN_ERRORS;

  1. Select Run to run the SQL assertion.redshift query editor v2 showing the SQL statement used to query the sys stream scan errors monitoring view

Clear up

After following alongside, in the event you now not want the assets you created, delete them within the following order to stop incurring extra fees:

  1. Delete the MSK Join connector msk-datagen-connector.
  2. Delete the MSK Join plugin msk-datagen-plugin.
  3. Delete the Amazon MSK knowledge generator JAR file you downloaded, and delete the S3 bucket you created.
  4. After you delete your MSK Join connector, you possibly can delete the CloudFormation template. All of the assets created by the CloudFormation template might be robotically deleted out of your AWS account.

Conclusion

On this put up, we demonstrated the way to configure Amazon Redshift streaming ingestion from Amazon MSK, with a concentrate on privateness and safety.

The mix of the flexibility of Amazon MSK to deal with excessive throughput knowledge streams with the strong analytical capabilities of Amazon Redshift empowers enterprise to derive actionable insights promptly. This real-time knowledge integration enhances the agility and responsiveness of organizations in understanding altering knowledge developments, buyer behaviors, and operational patterns. It permits for well timed and knowledgeable decision-making, thereby gaining a aggressive edge in at this time’s dynamic enterprise panorama.

This resolution can also be relevant for purchasers that need to use Amazon MSK Serverless and Amazon Redshift Serverless.

We hope this put up was a superb alternative to study extra about AWS service integration and configuration. Tell us your suggestions within the feedback part.


Concerning the authors

Sebastian Vlad is a Senior Companion Options Architect with Amazon Internet Companies, with a ardour for knowledge and analytics options and buyer success. Sebastian works with enterprise prospects to assist them design and construct fashionable, safe, and scalable options to attain their enterprise outcomes.

Sharad Pai is a Lead Technical Marketing consultant at AWS. He makes a speciality of streaming analytics and helps prospects construct scalable options utilizing Amazon MSK and Amazon Kinesis. He has over 16 years of trade expertise and is at present working with media prospects who’re internet hosting dwell streaming platforms on AWS, managing peak concurrency of over 50 million. Previous to becoming a member of AWS, Sharad’s profession as a lead software program developer included 9 years of coding, working with open supply applied sciences like JavaScript, Python, and PHP.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles