Streaming Data with Kafka and Microservices by Dean Wampler



A software architect, software design and patterns expert

Dean Wampler, Ph.D., is a member of the Office of the CTO and the Architect for Big Data Products and Services at Typesafe. He uses Scala and Functional Programming to build Big Data systems using Spark, Mesos, Hadoop, the Typesafe Reactive Platform, and other tools. Dean is the author or co-author of three O’Reilly books on Scala, Functional Programming, and Hive. He contributes to several open source projects (including Spark) and he co-organizes and speaks at many technology conferences and Chicago-based user groups.

Workshop:  Streaming Data with Kafka and Microservices

When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.

In this hands-on workshop, we’ll start with a brief overview of the characteristics of streaming architectures:
Use cases driving this evolution to streaming architectures.
Kafka (or emerging alternatives) as the data backplane, to capture data streams as logs between producers and consumers.
When you should use the feature-rich and highly-scalable processing engines, like Spark and Flink.
When you should use the more-flexible and lower-latency data processing libraries, like Kafka Streams and Akka Streams, inside microservices.

Then we’ll work through code examples that use Akka Streams and Kafka Streams with Kafka to implement a machine-learning example where a machine learning model is updated periodically to simulate the problem of periodic retraining and serving of ML models in a streaming context. In particular, if you periodically retrain the model using one tool chain, for example once a day, how to do you incorporate the updated model into a running pipeline for scoring without restarting the pipeline?

NOTE: BEFORE THE TUTORIAL, please setup your laptop by cloning the following GitHub repo: You can also download the latest release (2.2.3 at the time of this writing). Then follow the README’s setup instructions.

If you need help, post questions to the Gitter channel for the GitHub repo,

Sydney Nov 28| Melbourne Dec 5