Scalable architecture for real-time Twitter sentiment analysis

Posted on Posted in Blog

This post describes design and implementation of a scalable architecture to monitor and visualize sentiment against a twitter hashtag in real-time. The project streams live tweets from Twitter against a hashtag, performs sentiment analysis on each tweet, and calculates the rolling mean of sentiments. This sentiment mean is continuously sent to connected browser clients and displayed in a sparkline graph.

Complete source code of this project is available on GitHub

System design

The diagram below illustrates different components and information flow (from left to right). system design

Project breakdown

Project has three parts

1. Web server

A web server is a python flask server. It fetches data from twitter using Tweepy. Tweets are pushed into Kafka. A sentiment analyzer picks tweets from Kafka, performs sentiment analysis using NLTK and pushes the result back in Kafka. The sentiment is read by Spark Streaming server (part 3), it calculates the rolling average and writes data back in Kafka. In the final step, the web server reads the rolling mean from Kafka and sends it to connected clients via SocketIo. An HTML/JS client displays the live sentiment in a sparkline graph using Google annotation charts.

Web server runs each independent task in a separate thread.

Thread 1: fetches data from twitter
Thread 2: performs sentiment analysis on each tweet
Thread 3: looks for rolling mean from spark streaming

All these threads can run as an independent service to provide a scalable and fault tolerant system.

2. Kafka

Kafka acts as a message broker between different modules running within the web server as well as between web server and spark streaming server. It provides a scalable and fault tolerant mechanism of communication between independently running services.

3. Calculating rolling mean of sentiments

A separate java program reads sentiment from Kafka using spark streaming, calculates the rolling average using spark window operations, and writes the results back to Kafka.

How to run

To run the project

  1. Download, setup, and run Apache Kafka. I use following commands on OSX from bin dir of Kafka
    0
    1 sh zookeeper-server-start.sh ../config/zookeeper.properties
    2 sh kafka-server-start.sh ../config/server.properties
    3
    
  1. Install complete NLTK
  2. Create a twitter app and set your keys in
    0
    1 live_twitter_sentiment_analysis/webapp/tweet_ingestion/config.py
    2
    
  3. Install python packages
    0
    1 pip install -r /live_twitter_sentiment_analysis/webapp/requirements.txt
    2
    
  1. Run webserver
    0
    1 python live_twitter_sentiment_analysis/webapp/main.py
    2
    
  1. Run the maven-java project (rolling_average) after installing Maven dependencies specified in
    0
    1 live_twitter_sentiment_analysis/rolling_average/pom.xml
    2
    

    Don’t forget to set checkpoint dir in Main.java

  2. open the URL
    0
    1 localhost:8001/index.html
    2
    

Output

Here is what final output looks like in browser

Output

Complete source code of this project is available on GitHub

Note: Tested on Python 2.7

Leave a Reply

Your email address will not be published. Required fields are marked *