Java – Serialize RandomAccessSparseVector in Mahout

Serialize RandomAccessSparseVector in Mahout… here is a solution to the problem.

Serialize RandomAccessSparseVector in Mahout

I’m loading data into RandomAccessSparseVector in Mahout 0.7, but I don’t know how to serialize it. If I’m using VectorWritable, I’ll be able to use SequenceFile.Writer:: like this

writer = new SequenceFile.Writer(
    fs, conf, new Path("filename"), LongWritable.class,
    VectorWritable.class);

Unfortunately, there is no RandomAccessSparseVectorWritable.

One option is to forget about sparse vectors altogether and load the data into the VectorWritable and serialize it. I want to avoid this because it’s sloppy to manually enter a lot of zeros in VectorWritable and then take up a lot of disk space when serializing. RandomAccessSparseVector also cannot be converted to VectorWritable.

If useful, I’ve set it up

Configuration conf = new Configuration();
conf.set("io.serializations",
    "org.apache.hadoop.io.serializer.WritableSerialization");

So that Hadoop knows how to serialize.

Solution

The solution is very simple. After a period of fruitless digging through the API documentation, I stumbled upon a useful forum post. VectorWritable is not a vector type, but a vector wrapper for serialization. Previously, I tried to write a RandomAccessSparseVector generated like this

RandomAccessSparseVector vect = new RandomAccessSparseVector(columns);

By call

key = new LongWritable(foo)
RandomAccessSparseVector vect = new RandomAccessSparseVector(columns);
writer.append(key, vect)

I just have to call

writer.append(key, new VectorWritable(vect))

Related Problems and Solutions