Hadoop Java vs. C/C++ for CPU-intensive tasks
I’m new to Hadoop. I want to use Hierarchical Clustering to cluster about 150 million items, each with about 30 attributes. The total number of dimensions/attributes is approximately 5000.
I designed a multi-level solution that partitioned the entire data and clustered each partition, then merged each cluster until the desired number of clusters was retrieved.
- Clustering is performed in each map task. So, each map task would be cpu-intensive. - I am stuck at deciding about which of the following options to use: - Map-Reduce in native Java. - Map-Reduce using Hadoop Streaming on C.(This is because of each task being cpu-intensive). Which option should I go with?. Is there any other way I could achieve my destination?
In many cases, Java, if written properly, will produce C-like performance unless C code is carefully optimized. In surprisingly many cases, well-written Java code is indeed superior to C code because C code is optimized at compile time, while the Java hotspot compiler optimizes at runtime (it optimizes at runtime and has statistics on how often each code path is used).
If you collect similar statistics and they do not change based on your data, you can sometimes provide hints to the C compiler, such as by using
__builtin_expect() that is available in some C compilers. But it’s really hard to do.
Keep in mind, however, that some parts of Java are quite expensive:
- Never use
ArrayList<Double>etc. for calculations because of the boxing cost. These are really very expensive in thermal cycling.
- Consider using faster I/O than
BufferedReader. Hadoop uses
Textfor a reason, not
String– buffer recycling reduces I/O costs.
- Start-up costs. Your app should run for a long time and not restart frequently.
remember that Hadoop streaming is not free. If you haven’t realized: hadoop-streaming itself is implemented in Java. All data will go through Java. A Hadoop stream is a Java application that starts your script application, writes data to it (i.e., serializes data!), and reads back output (deserializes data!). You get almost all Java costs in addition to your actual program costs: hadoop streaming is a mapper written in Java that passes data to an external program, reads the answer back, and returns it to Hadoop. Benchmark something simple, like word count in C versus optimized word count in Java, to see the difference.
For the actual task you perform HAC: first make sure you have work similarities. There’s nothing worse than building a large-scale clustering algorithm just to find out it doesn’t work because you can’t measure similarity in a meaningful way. Start by solving problems on small samples, then scale up.