You've deployed and setup a private Cloud platform but now what? You need an application!
I've been experimenting with a number of technologies to generate workloads and give some demos to prospective Eucalyptus customers. A NoSQL database seems like a great use-case to demo as the technology benefits from being designed for scale-out workloads and this happens to be exactly what an IaaS Cloud does best.
There are an abundance of NoSQL implementations (Cassandra, MongoDB, Couchbase, Neo4j...), written in different programming languages and with slightly different takes on which two parts of the CAP theorem they choose to implement and which method they will use to store and display data.
For this post I'm going to be using MongoDB, which is in the "CP" camp, it handles Consistency and Partition Tolerance whilst forgoing Availability (Every request may not see a response), although MongoDB still provides some great availability options.
MongoDB is supported by 10gen, seems fairly mature and has a large community of users with modules for a ton of different programming languages. Cassandra also interests me and I'll tackle that in a later post.
We also need a bunch of data and whilst there are large datasets available on the internet, last week I read a post on using the Twitter streaming API with Ruby and storing that data in MongoDB and thought it would be cool to use it, albeit with Python instead of Ruby.
Creating an ssh keypair and application security group
To start, let's setup a keypair and security group for MongoDB so that we can ensure it is not going to be accessed by anyone else:
# Ensure we have our Eucalyptus or Amazon credentials in the environment source ~/eucarc # Create an ssh keypair euca-add-keypair mongodb > ~/mongodb.key chmod 400 ~/mongodb.key # Add SSH, MongoDB and MongoDB admin interface ports to mongodb security group euca-create-group mongodb -d "MongoDB databases" # Replace 0.0.0.0/0 with your IP e.g. 126.96.36.199/32 to restrict it to just your system euca-authorize -P tcp -p 22 -s 0.0.0.0/0 mongodb euca-authorize -P tcp -p 27017 -s 0.0.0.0/0 mongodb euca-authorize -P tcp -p 28017 -s 0.0.0.0/0 mongodb
Run an instance
We can now spin up an instance running Ubuntu 12.04 LTS x86_64 and install MongoDB on our private cloud:
euca-run-instances -k mongodb -g mongodb -t c1.xlarge emi-87F63CE5
If you are using AWS or your own cloud you'll need to substitute the EMI ID I've used with one an AMI of Ubuntu or your own image ID. You will also need to use your own keypair.
After a few moments our instance should show as 'running':
$ euca-describe-instances RESERVATION r-AB3F4645 985725263417 mongodb INSTANCE i-D89D40E2 emi-87F63CE5 188.8.131.52 184.108.40.206 running mongodb 0 c1.xlarge 2013-02-03T22:40:26.743Z cluster1 eki-222540D6 eri-A5753DBE monitoring-disabled 220.127.116.11 18.104.22.168 instance-store
Let's connect to the instance and install MongoDB:
The MongoDB documentation goes into the installation of MongoDB in more detail.
Ubuntu 12.04 LTS has version 2.0.4 of MongoDB in it's repositories, 2.2.3 is the current stable version upstream so we'll use the repository from 10gen to install the latest package.
ssh -i mongodb.key firstname.lastname@example.org #replace 22.214.171.124 with your instance IP! sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10 echo "deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen"| sudo tee -a /etc/apt/sources.list.d/10gen.list sudo apt-get update sudo apt-get install -y mongodb-10gen
At this point we have an instance running that has MongoDB installed and running. You should be able to navigate to the MongoDB admin interface in your web browser:
Now we have MongoDB running, we need to import some twitter data. Twitter has a streaming API that is publicly accessible (as long as you have a twitter account!) and there a number of modules for the programming language of your choice.
Tweetstream isn't packaged for Ubuntu, so I'll use the source:
sudo apt-get install -y python-setuptools wget -c http://pypi.python.org/packages/source/t/tweetstream/tweetstream-1.1.1.tar.gz tar -zxvf tweetstream-1.1.1.tar.gz cd tweetstream-1.1.1 && sudo python setup.py install
pyMongo is the official MongoDB python driver and is available from the Ubuntu archive.
sudo apt-get install -y python-pymongo
Writing a python script to save tweets into MongoDB
This following script is based on some of those examples. It connects to MongoDB and stores tweets in a collection called 'twitterstream'. It stores the whole tweet which includes a lot of metadata, it might be useful to use this metadata later to sort tweets or index for particular fields we are interested in querying. It's important to note that the streaming API does not give us all tweets on twitter, it's merely a small percentage as the "Firehose" API that contains all tweets is not public.
import tweetstream import pymongo username = "TWITTER_USERNAME" password = "TIWTTER_PASSWORD" mongohost = "localhost" connection = pymongo.Connection(mongohost, 27017) db = connection.twitterstream with tweetstream.TweetStream(username, password) as stream: for tweet in stream: try: # Save the whole tweet but only show certain fields on screen db.tweets.save(tweet) print tweet['created_at'], tweet['id'], "Username: ", tweet['user']['screen_name'],':', tweet['text'].encode('utf-8') except: pass
If we run this, you should see a stream of tweets printed out and the whole tweets stored within MongoDB:
Use the mongo shell to see if there are entries in the database:
$ mongo MongoDB shell version: 2.2.3 connecting to: test Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see http://docs.mongodb.org/ Questions? Try the support group http://groups.google.com/group/mongodb-user > > show dbs admin (empty) local (empty) twitterstream 0.203125GB > use twitterstream switched to db twitterstream > show collections system.indexes tweets > db.tweets.find()
The final command should output a portion of the tweets in the json document format that MongoDB queries are displayed in.
That's it, we're now streaming tweets into MongoDB via Python tweetstream!
In part 2, I'll investigate scaling out the MongoDB database by spinning up new Eucalyptus instances and configuring replication and sharding.