Back around the December – January time frame, I was trying to implement the Lambda Architecture as described by Nathan Marz. At that time, the early-release version of his upcoming Big Data book was just at chapter 5 or 6, but my goal was to tackle what seemed liked the harder part — real-time (Storm). The book chapters hadn’t yet caught up to it. A few slide decks mentioned their current implementations of a fully thought-out, end-to-end Lambda Architecture implementation that included Storm, but no reliable, easy-to-deploy code was readily available from the interwebs.
In installing Storm, it quickly seemed apparent that having Kafka running upstream of it was one way to support both real-time and batch processing of incoming data, and probably the one of least resistance. So I added installing Kafka to my to-do list.
Cutting edge technology means dealing with rough edges. I downloaded the latest versions of the relevant software components, but the integration of all of them didn’t work. As I found out, the reason was that versions of components that finally worked together with each other for me were not the most recent, but instead maybe a version or two behind.
The code that I ended up with to get Kafka and Storm working together on a toy example using the Twitter Dev Stream is on github here:
My twist on the “hello world” of the distributed computing world since Hadoop — word count — was to have the final bolt in Storm periodically emit the top 5 most recently seen words. Although I was following Packt’s Storm book, I switched over to Clojure for the code and deployment only because the Java + Maven was getting hard to read and use. But to be fair, in hindsight, I may have inserted an extra space in the Maven command to deploy an early example where the command wrapped from one line to the next in the book.
In the meantime, you may come across an example or two of more up-to-date and/or fleshed-out examples of Storm, Kafka, etc. implementing the Lambda Architecture. And I haven’t caught up with all of the new chapters of Big Data, but it seems to use a Trident Kafka Spout. That is a good thing since Trident seems to be for Storm what Cascading is for Hadoop, but Trident also has extra goodies like guaranteeing once-only semantics. For those who have the time, finding a Lambda Architecture example that useslater release versions of the components that work together and also include Trident would be great, and maybe the Trident Kafka Spout would be a good place to start.