Follow us

AWS MSK Kafka To Process Event/Streaming Data

As we should know Apache Kafka is a distributed event streaming platform and capable of handling big no. of data from multiple sources.

We it worked very much near real time and ideal for using data processing real time and use the processed data for auditing, tracking or any immediate push notification.

To use just Apache Kafka open source, you have to do all the setups yourself and in that also you would need severs which will configured as Kafka cluster and still there is challenge, that how it will scale and probably to make scalable Kafka solution, it is best to use inbuilt Cloud solution for Kafka. And AWS provide MSK managed Kafka cluster where all the infrastructure is managed by AWS and you focus on your code and business logic.

Today I did some experiment to see, how I can capture log from my multiple EC2 systems which will produce some event/message and it will go to AWS KMS Kafka and then from there, using AWS Glue, will process that event/message and send processed event/message to Redshift.

 

Here is the architecture diagram, which I wanted to achieve and below you will find high level steps if you want to try this yourself to test this solution.

 

EC2 
EC2 
KM 
EC2 
Wient) 
Crawler Job 
Glue 
Redshift

 

Demo Steps:

  • Create MSK cluster and once it successfully created, copy Bootstrap server and Zookeeper connection to use with below steps.
    • Bootstrap server: You need to produce and consume message on topic and store into Kafka cluster.
    • Zookeeper connection: You need to setup/create topic.

Amazon MSK 
MSK Clusters 
Clusters 
Cluster configurations 
MSK Connect 
Connectors 
Custom plugins 
Worker configurations 
Resources 
AV.'S Streaming Data Solution 
Glue Schema Registry 
x 
Amazon MSX > Clusters > demw»cluster-l 
View client information 
Bootstrap servers 
> View client information 
initial to the this in 
Authentication type 
9092, 
Apache ZooKeeper connection 
er the 
host/port paiß by b&er for stablishing a to the A;NChe node

 

  • Create Linux ec2 machine as client(event/message producer) and connect ec2 instance (SSH connection).
    • Note: With client ec2 instance, make sure you have role attached with role having permission "AmazonMSKFullAccess".
  • Run below commands on your ec2:
  • Run below command now to create topic: MSKTutorialTopic

./kafka-topics.sh --create --zookeeper <ZookeeperConnectString> --replication-factor 3 --partitions 1 --topic MSKTutorialTopic

  • Run below command to make connection to Kafka cluster to produce/publish message to topic and write messages:

./kafka-console-producer.sh --broker-list <BootstrapBrokerString> --producer.config client.properties --topic MSKTutorialTopic

https://console.aws.amazon.com/ec2/v2/connect/ec2-user/i-0f08b8d9c6fd33b54 
Last login: sat Feo 28 Il: ii:22 from L8-2'3S-Lü7-24.compuT±-1.amazonaws.com 
Amazon Linux 2 AMI 
https : / / aws . amazon . com/amazon-Iinux-2/ 
h package(s) needed for security, out of 14 available 
Run "sudo yum update" to apply all updates. 
[ec2-user@ip-172-31-94-233 cd kafka 2.12-2.6.2/bin/ 
[ec2-user@ip-172-31-94-233 bin]$ ./kafka-console-producer.sh 
--broker-list b-l.demo-cluster-l. tee7p6.c11.kafka.us-east-1. 
1 . kafka . us - east- 1 . amazonaws.com:9092,b-2.demo-cluster-I.tee7p6.Cll.kafka.us-east-1.amazonaws . com:9092 
-producer . config 
OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure 
- : Pa r GCT hr 
usl 
est 5 
*test 6 
*test 7

 

  • Run below consumer to read real time message, open this in new tab:

./kafka-console-consumer.sh --bootstrap-server <BootstrapBrokerStringTls> --consumer.config client.properties --topic MSKTutorialTopic --from-beginning

[ec2-user@ip-172-31-94-233 bin]$ 
. /kafka-console-consumer. sh 
--bootstrap-server b-l.d 
p6. c 11 . kafka . us-east-1.amazonaws.com:9092.b-2.demo-cluster-I.tee7p6.Cll.kafka.us-east 
-from-beginning 
I)penJDK 64-Bit Server VM warning: If the number of processors is expected to increase 
rajeev 
restl 
rest2 
est 6 
est 7

 

Now let's, see how I could push same message to Redshift using AWS Glue

 

  • Redshift:
    • Create stream/database
    • Once status is active, create table by using query editor like: create table msk_message(message VARCHAR(max))

 

  • Glue:
    • Add connection for MSK kafka
    • Add connection for redshift

AWS Glue 
Data catalog 
Databases 
Connections 
Craw ers 
Schema registries 
Connections A connection contains the properties needed to connect to your data 
Test connecbo n 
Name 
MSK Connection 
Redshiff Connection 
TYPe 
Joac 
26 February 
26 February

 

  • Create database
  • Create table as manual and choose source as Kafka and put there topic name on which message will be publish and select MSK Kafka connection you created

AWS Glue 
Data catalog 
Databases 
Tables 
Connect ons 
Crawlers 
Tables A table is the metadata definition that represents your data. including its schemi A table can be used as a source or target in a 
Add tables 
Name 
msk_mess ages 
by AttribJtæ by 
Location 
MSKTutorialTopic

 

  • Create crawler job - this we create basically where to put data, so select Redshift connection created above as source

AWS Glue 
Crawlers A crawler connects to a data store. progresses through a prioritized list of classifiers to determine the schema for your da 
Data catalog 
Databases 
Add crawler 
Connect ons 
Schema registries 
Settings 
Q by t.asand 
Logs

 

  • From ETL section, create job (legacy option I used here)
    • Select source as database table you created for MSK Kafka connection above
    • Select target as redshift table you created for Redshift connection above
    • Here, on last step, it will generate script(code) which actually store you your S3 location and basis this code it pull data and push data
    • Once it created, you can run this manually or on schedule and you could check the status, matrix and log from bottom different tab options.

AWS Glue 
Data catalog 
Databases 
Connect ons 
Crawlers 
Schema registries 
Settings 
A'.n,s Glue studio 13 
Jobs 13 - 
Jobs (legacy) 
ML Transforms 
Blueprints 
Jobs Ajob is your business logic required topefformextract, transform and load (ETL) work Job runs are initiated by triggers which can be scheduled or driven by 
Add j Db 
Name 
Run ID 
Q Filter 
Details 
Script 
Rewind job bookmark 
Retry 
attempt 
Run 
Error 
status 
*trics 
Output 
Output 
Logs 
Logs 
Type 
Spark 
Error logs 
version 
Error IDgs 
ETL language 
python 
start 
Triggered by 
capa city 
26 Fe.. 
Script location 
End 
26 Fe.

 

  • Run Glue Job by selecting and after Glue job successful execution, I can see the same message I had published to Kafka.

Q Search for services, feo tures, blogs, docs, and more 
[Alt 
Databæe 
(preview) 
Redshift query editor v2 
+ Crete • 
v public 
v Table 
uærs 
ate-gory 
date 
wit 
listing 
> Fututio•s 
> Stored 
rnsk_message 
auster 
Load data 
Untitled I x 
Limit U Explain 
1 --creste tsbl± 
2 select * frce 
Result 1 (7) 
O test 5

 

Categories/Tags: kafka~aws msk

Recent Articles

1

AWS Saving Plan - Cost optimization tips

2
3

AWS RDS Key Concepts & Why you should use it?

4
5

Open-Search/Kibana - Multi Tenancy Setup

See All Articles