Python – Dump Twitter tweets from MongoDB to COSMOS

Dump Twitter tweets from MongoDB to COSMOS… here is a solution to the problem.

Dump Twitter tweets from MongoDB to COSMOS

I WAS WONDERING WHAT IS THE BEST WAY TO DUMP A LOT OF TWEETS FROM THE TWEET STREAMING API INTO COSMOS TO RUN A VERY SIMPLE MR JOB.

I’m thinking about converting collection documents to CSV, maybe one per line, and then scp them to COSMOS. But I’m not sure if I need HIVE to run the MR job there, or if I can run the job in a more manual way. I’m thinking about using Python for this, I’d rather not have to use Java.

Thanks

Solution

I don’t think it’s necessary to dump the data, MongoDB connector for Haddop can be used. AFAIK, such connectors allow you to fetch data only when it is about to be processed, and records from data splitting, as they are needed by Hadoop’s map process. I mean, instead of using the default FileInputFormat, I use MongoInputFormat, which implements InputFormat interface, thus providing methods to get a list split (which would be some kind of constant-sized partition of data in MongoDB, such as a block of a collection) and a way to get records in a split (such as a JSON document in a block of a collection).

Such connectors must be installed on all nodes of the cluster; It works in a similar way in our roadmap, along with connector our own CKAN. It will expire at the end of September.

That being said, if for any reason you still want to dump data to HDFS, the best thing to do is to create a script that takes care of reading MongoDB data and converting it to Cygnus Unstable NGSI-like notifications. Then Cygnus will do the rest.

Related Problems and Solutions