Hbase mapreduce job : all column values are null
I’m trying to create a map-reduce job in Java on a table in an HBase database. Using the example in and other things on the internet, I managed to successfully write a simple row counter. However, trying to write a program that actually performs some action on the data in the column is unsuccessful because the bytes received are always empty.
Part of my Driver job goes something like this:
/* Set main, map and reduce classes */
job.setJarByClass(Driver.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
/* Get data only from the last 24h */
Timestamp timestamp = new Timestamp(System.currentTimeMillis());
try {
long now = timestamp.getTime();
scan.setTimeRange(now - 24 * 60 * 60 * 1000, now);
} catch (IOException e) {
e.printStackTrace();
}
/* Initialize the initTableMapperJob */
TableMapReduceUtil.initTableMapperJob(
"dnsr",
scan,
Map.class,
Text.class,
Text.class,
job);
/* Set output parameters */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
As you can see, the table is named DNSR
. My mapper looks like this :
@Override
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
byte[] columnValue = value.getValue("d".getBytes(), "fqdn".getBytes());
if (columnValue == null)
return;
byte[] firstSeen = value.getValue("d".getBytes(), "fs".getBytes());
if (firstSeen == null)
return;
String fqdn = new String(columnValue).toLowerCase();
String fs = (firstSeen == null) ? "empty" : new String(firstSeen);
context.write(new Text(fqdn), new Text(fs));
}
Some notes:
- The column family in
a DNS
table is justD
. There are multiple columns, some of which are calledfqdn
andfs
(firstSeen); - Even though
the FQDN
value is displayed correctly, fs is always an “empty” string (I added this check string after I got some errors indicating that you can’t convert null to a new string); - If I change
the fs
column name with something else, e.g.ls
(lastSeen), it works; - The reducer does nothing but output everything it receives.
I
created a simple table scanner in javascript that is querying the exact same tables and columns and I can clearly see the values there. Using the command line and executing the query manually, I can clearly see that the fs
values are not empty, they are bytes that can later be converted to a string (representing a date).
What could be the problem with me always getting null?
Thanks!
Update:
If I get all the columns in a particular column family, I don’t receive fs
. However, a simple scanner implemented in JavaScript returns FS
as a column in the DNS
table.
@Override
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
byte[] columnValue = value.getValue(columnFamily, fqdnColumnName);
if (columnValue == null)
return;
String fqdn = new String(columnValue).toLowerCase();
/* Getting all the columns */
String[] cns = getColumnsInColumnFamily(value, "d");
StringBuilder sb = new StringBuilder();
for (String s : cns) {
sb.append(s).append(";" );
}
context.write(new Text(fqdn), new Text(sb.toString()));
}
I used a > from The answer gets all column names.
Solution
Finally, I managed to find the “problem”. Hbase is a column-oriented data store. Here, data is stored and retrieved by columns, so if you only need some data, you can only read relevant data. Each column family has one or more column qualifiers (columns), and each column has multiple cells. Interestingly, each cell has its own timestamp.
Why does this issue occur? Well, when you do a range search, only the cells with timestamp in the range will be returned, so you may end up with rows containing “missing cells”. In my case, I have a DNS record and other fields such as firstSeen
and lastSeen
. lastSeen
is a field that is updated every time that field is seen, and firstSeen
remains unchanged after the first occurrence. Once I changed the remote map reduce job to a simple map reduce job (using all the time data), everything was fine (but the job took longer to complete).
Cheers!