Log Analysis Part 1

Posted on Posted in Blog

Big data processing in general and log analysis in particular is not new but that doesn’t mean it is easy. It’s a hard problem to deal with. A lot of companies have written extensively on how they developed solutions around their needs. So the good news is that you can avoid mistakes by learning from experiences of others.

There exist a lot of tools and technologies to process huge amounts of data. In this blog series I describe my experiments and findings on few such systems. The machine used for these experiments is a Macbook Pro with 2 GHz intel core i7 processor and 8GB 1600MHz DDR3 RAM.

ELK Stack

elkstackELK Stack or Elastic Stack is combination of three technologies, Elasticsearch, Logstash, and Kibana, to extract and display insights from large amounts of data in real-time. Elasticsearch is world famous search and analytics engine. Logstash is used for collecting and parsing logs. Kibana is a powerful UI layer that sits on top of Elasticsearch to visually show its data. All these three tools combine to give you a powerful framework for processing and understanding huge amounts of data in real-time.

Dataset used is a very simple set of log files. Each line in the log file represents exactly one log entry thats consists of a number field and a timestamp.

200010 1477304952234
200011 1477304952234
200012 1477304952235
200013 1477304952236
200014 1477304952237
200015 1477304952238
200016 1477304952239
200017 1477304952234

To read and process log data, I created a Logstash config file containing one filter with one grok command inside it. This filter extracts the number and and timestamp from log entry.

0
1filter {
2    grok {
3      match => { "message" => "%{GREEDYDATA:count}\t%{GREEDYDATA:timestamp}"}
4    }
5  }
6

On running this configuration I got following results.

Log entries Time
50K Under 5 seconds
500K 1 minute
1M 1 minute 54 seconds
2M 3 minutes 48 seconds

Which boils down to roughly 9K entries per second. Couple of things to keep in mind here

  • Log entries are pretty simple in this case, more complex entries would decrease the processing rate.
  • Machine used is a simple developer system, a fine tuned server cluster would be able to achieve far better processing rate. Infact, there are systems that were able to achieve 25K, 50K, and even 100K events per second.

To set the perspective, a production ready Logstash deployment typically has the following pipeline;

  • An input layer that consists of Logstash instances with the proper input plugins, e.g. Filebeat, to consume data from sources.
  • Followed by a message queue that acts as a buffer to hold ingested data and also serves as failover protection.
  • A filter layer that do the parsing and other processing to the data consumed from the message queue.
  • The final layer that moves the processed data into Elasticsearch.

Though 9K/sec on a dev environment and upto 100K/sec on production is not bad but there are real-world scenarios where incoming rate of data is easily more than that. With these benchmarks, I feel Elasticsearch would have tough time coping up with larger data rate, e.g. a million records per second. The reason I feel this way about Elasticsearch is because Elasticsearch is primarily a data search engine. It has to quickly index every incoming record to enable accurate and fast searching.

Log analysis, on the other hand, is not a kind of problem where you want to store each individual record. Rather, you would be more interested in getting counts and aggregates over some time period or window. Explore more on this in part two of this blog series.

Leave a Reply

Your email address will not be published. Required fields are marked *