Iterative kmeans based on MapReduce and Hadoop
I’ve written a simple k-means cluster code for Hadoop (two separate programs – mapper and reducer). The code is working on a small 2D point dataset on my local box. It’s written in Python, and I’m going to use the Streaming API.
Each time you run Mapper and Reducer, a new hub is generated. These centers are the inputs for the next iteration.
As suggested, I used MRJOB, Job Python, suitable for multiple steps
This is just an iteration, please tell me any way to feed back to mapper after building the new hub. I mean, as you can see in the last step (“reducer”), a new hub will be generated, and now it’s time to feed it back to the mapper again (step one) to calculate the new distance to the new center and so on until a satisfactory convergence.
(Please don’t tell me about Mahout, spark, or any other implementation, I know them.) )
When running K-Means to stop execution, we typically define the number of iterations or threshold distance. Here we may want to write a chainmap reduce for the number of iterations. Write the output of the cluster centroid to a temporary file and provide it to the next mapper. The number of times is equal to your threshold.