Unless you’ve been passed out at a bar you’ve probably heard about Map/Reduce, Hadoop or just the “cloud”. Afterall, it’s the “next big thing”, if you’re into that sort of stuff.
At work the “cloud”, Map/Reduce, analytics are all the hotness. I figured I’d try out the Hadoop Streaming API by piping data in and out with Ruby. I started out with a very small text file. Next I wrote the mapper which reads lines of text from STDIN and emits results through STDOUT.
The reducer would be as trivial as the mapper except that it needs to keep track of when the key changes. Unlike the standard reducer API which provides a key with all of its values, the streaming API doesn’t do that. Besides that the reducer (at least in my trivial example) is rather simple.
The best part, is you can do all of this without ever setting up or using Hadoop! Now wait a minute, I thought I was trying out the Hadoop Streaming API. Well, you can do all the work and even test your mapper and reducer without using Hadoop. By simply piping data into and out of the mapper and reducer you can simulate what Hadoop will do. If you want to get real fancy pants you can throw in an intermediate sort before piping data to your reducer.
Check out my gist which does the minimal thing before getting Hadoop involved.
Pretty cool stuff I think. Now I just run this with a real cloudy Hadoop setup and everything will be real groovy!