Continuing on the log analysis journey, in this post I explore Apache Storm. Apache Storm is a framework for real time, distributed, fault tolerant computation. Storm gives you a set of abstractions to help build systems that can analyze a large volume of streaming data in real time. Here is an excellent talk on Storm by its creator, Nathan Marz, for anyone who wants to dive into the details.
If you don’t have time for an hour long presentation, following is a summary of how Storm works.
Storm does all the processing in memory and leaves the persistence layer implementation to user/developer. The core abstraction in Storm is the stream. A stream is an unbounded or continuous sequence of data, e.g. incoming logs. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. For example, you may transform a stream of tweets into a stream of trending topics hence providing data aggregation. Storm’s computation model is relatively easy to grasp. A complex analysis engine will boil down to a topology of spouts and bolts which are all share nothing. Spouts send data into the system and bolts process them. Storm do not provides a persistent state but a lot of storage connectors modules are already built and can be used for persisting the data processed by Storm.
For experimentation with Storm, I used a slightly modified version of word count example provided by Storm starter kit. On my development machine, I was able to process 18 million words in less than a minute or 300K/sec. It is important to keep few points in mind here;
- 300K/sec is a big number for a start, a fine tuned cluster would be able to perform far better.
- Because Storm does all the processing in memory, unless you are doing some heavy operations in bolts, this model can process data really fast.
- Unlike ELK Stack, where you can do an end-to-end processing, Storm is just one part of the big puzzle. In order to effectively use Storm, you will have to combine many bits and pieces to get a fully functional pipeline. Lambda architecture is a famous pattern used for creating such pipelines.