AWS MSK Kafka To Process Event/Streaming Data

Rajeev Jha (Tech/Solution Consultant) 26-Feb-2022

As we should know Apache Kafka is a distributed event streaming platform and capable of handling big no. of data from multiple sources.

We it worked very much near real time and ideal for using data processing real time and use the processed data for auditing, tracking or any immediate push notification.

To use just Apache Kafka open source, you have to do all the setups yourself and in that also you would need severs which will configured as Kafka cluster and still there is challenge, that how it will scale and probably to make scalable Kafka solution, it is best to use inbuilt Cloud solution for Kafka. And AWS provide MSK managed Kafka cluster where all the infrastructure is managed by AWS and you focus on your code and business logic.

Today I did some experiment to see, how I can capture log from my multiple EC2 systems which will produce some event/message and it will go to AWS KMS Kafka and then from there, using AWS Glue, will process that event/message and send processed event/message to Redshift.

Here is the architecture diagram, which I wanted to achieve and below you will find high level steps if you want to try this yourself to test this solution.

Demo Steps:

Create MSK cluster and once it successfully created, copy Bootstrap server and Zookeeper connection to use with below steps.

Bootstrap server: You need to produce and consume message on topic and store into Kafka cluster.
Zookeeper connection: You need to setup/create topic.

Create Linux ec2 machine as client(event/message producer) and connect ec2 instance (SSH connection).

Note: With client ec2 instance, make sure you have role attached with role having permission "AmazonMSKFullAccess".

Run below commands on your ec2:

Install java: sudo yum install java-1.8.0
Download kafka: wget https://archive.apache.org/dist/kafka/2.6.2/kafka_2.12-2.6.2.tgz
Run this command to install: tar -xzf kafka_2.12-2.6.2.tgz
Run command to go inside Kafka: cd kafka_2.12-2.6.2/bin/

Run below command now to create topic: MSKTutorialTopic

./kafka-topics.sh --create --zookeeper <ZookeeperConnectString> --replication-factor 3 --partitions 1 --topic MSKTutorialTopic

Run below command to make connection to Kafka cluster to produce/publish message to topic and write messages:

./kafka-console-producer.sh --broker-list <BootstrapBrokerString> --producer.config client.properties --topic MSKTutorialTopic

Run below consumer to read real time message, open this in new tab:

./kafka-console-consumer.sh --bootstrap-server <BootstrapBrokerStringTls> --consumer.config client.properties --topic MSKTutorialTopic --from-beginning

[ec2-user@ip-172-31-94-233 bin]$
. /kafka-console-consumer. sh
--bootstrap-server b-l.d
p6. c 11 . kafka . us-east-1.amazonaws.com:9092.b-2.demo-cluster-I.tee7p6.Cll.kafka.us-east
-from-beginning
I)penJDK 64-Bit Server VM warning: If the number of processors is expected to increase
rajeev
restl
rest2
est 6
est 7

Now let's, see how I could push same message to Redshift using AWS Glue

Redshift:

Create stream/database
Once status is active, create table by using query editor like: create table msk_message(message VARCHAR(max))

Glue:

Add connection for MSK kafka
Add connection for redshift

AWS Glue
Data catalog
Databases
Connections
Craw ers
Schema registries
Connections A connection contains the properties needed to connect to your data
Test connecbo n
Name
MSK Connection
Redshiff Connection
TYPe
Joac
26 February
26 February

Create database
Create table as manual and choose source as Kafka and put there topic name on which message will be publish and select MSK Kafka connection you created

AWS Glue
Data catalog
Databases
Tables
Connect ons
Crawlers
Tables A table is the metadata definition that represents your data. including its schemi A table can be used as a source or target in a
Add tables
Name
msk_mess ages
by AttribJtæ by
Location
MSKTutorialTopic

Create crawler job - this we create basically where to put data, so select Redshift connection created above as source

AWS Glue
Crawlers A crawler connects to a data store. progresses through a prioritized list of classifiers to determine the schema for your da
Data catalog
Databases
Add crawler
Connect ons
Schema registries
Settings
Q by t.asand
Logs

From ETL section, create job (legacy option I used here)

Select source as database table you created for MSK Kafka connection above
Select target as redshift table you created for Redshift connection above
Here, on last step, it will generate script(code) which actually store you your S3 location and basis this code it pull data and push data
Once it created, you can run this manually or on schedule and you could check the status, matrix and log from bottom different tab options.

Run Glue Job by selecting and after Glue job successful execution, I can see the same message I had published to Kafka.

Q Search for services, feo tures, blogs, docs, and more
[Alt
Databæe
(preview)
Redshift query editor v2
+ Crete •
v public
v Table
uærs
ate-gory
date
wit
listing
> Fututio•s
> Stored
rnsk_message
auster
Load data
Untitled I x
Limit U Explain
1 --creste tsbl±
2 select * frce
Result 1 (7)
O test 5

Categories/Tags: kafka~aws msk

More Blogs/Articles

AWS MSK Kafka To Process Event/Streaming Data

Recent Articles

AWS Saving Plan - Cost optimization tips

.NET 8 key features and why you should upgrade to .NET 8

AWS RDS Key Concepts & Why you should use it?

AWS IAM Identity Center(SSO with all Organization Accounts)

Open-Search/Kibana - Multi Tenancy Setup