Serialize RandomAccessSparseVector in Mahout
I’m loading data into RandomAccessSparseVector
in Mahout 0.7, but I don’t know how to serialize it. If I’m using VectorWritable
, I’ll be able to use SequenceFile.Writer
:: like this
writer = new SequenceFile.Writer(
fs, conf, new Path("filename"), LongWritable.class,
VectorWritable.class);
Unfortunately, there is no RandomAccessSparseVectorWritable
.
One option is to forget about sparse vectors altogether and load the data into the VectorWritable
and serialize it. I want to avoid this because it’s sloppy to manually enter a lot of zeros in VectorWritable
and then take up a lot of disk space when serializing. RandomAccessSparseVector
also cannot be converted to VectorWritable
.
If useful, I’ve set it up
Configuration conf = new Configuration();
conf.set("io.serializations",
"org.apache.hadoop.io.serializer.WritableSerialization");
So that Hadoop knows how to serialize.
Solution
The solution is very simple. After a period of fruitless digging through the API documentation, I stumbled upon a useful forum post. VectorWritable
is not a vector type, but a vector wrapper for serialization. Previously, I tried to write a RandomAccessSparseVector
generated like this
RandomAccessSparseVector vect = new RandomAccessSparseVector(columns);
By call
key = new LongWritable(foo)
RandomAccessSparseVector vect = new RandomAccessSparseVector(columns);
writer.append(key, vect)
I just have to call
writer.append(key, new VectorWritable(vect))