Java – HDFS file checksum

HDFS file checksum… here is a solution to the problem.

HDFS file checksum

I’m trying to use the Hadoop API – DFSCleint.getFileChecksum() to check file consistency after copying to HDFS.

I get the following output of the above code:

Null
HDFS : null
Local : null

Who can point out errors or errors?
Here is the code:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileChecksum;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;

public class fileCheckSum {

/**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {
         TODO Auto-generated method stub

Configuration conf = new Configuration();

FileSystem hadoopFS = FileSystem.get(conf);
      Path hdfsPath = new Path("/derby.log");

LocalFileSystem localFS = LocalFileSystem.getLocal(conf);
          Path localPath = new Path("file:///home/ubuntu/derby.log");

  System.out.println("HDFS PATH : "+hdfsPath.getName());
          System.out.println("Local PATH : "+localPath.getName());

FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log"));
        FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log"));

if(null!=hdfsChecksum || null!=localChecksum){
            System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+"\t"+hdfsChecksum.getLength());
            System.out.println("Local Checksum : "+localChecksum.toString()+"\t"+localChecksum.getLength());

if(hdfsChecksum.toString().equals(localChecksum.toString())){
                System.out.println("Equal");
            }else{
                System.out.println("UnEqual");

}
        }else{
            System.out.println("Null");
            System.out.println("HDFS : "+hdfsChecksum);
            System.out.println("Local : "+localChecksum);

}

}

}

Solution

Since you don’t set a remote address on the conf and essentially use the same configuration, both hadoopFS and localFS point to instances of LocalFileSystem.

getFileChecksum is not implemented for LocalFileSystem and returns null. However, it should work for DistributedFileSystem, and if your conf points to a distributed cluster, FileSystem.get(conf) should return an instance MD5 of MD5 of CRC32 checksums The size of the DistributedFileSystem is The block of bytes.per.checksum. The value depends on the block size and the cluster-wide configuration bytes.per.checksum. This is why these two parameters are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is the number of CRC checksums per block and yyy is byte. The per.checksum parameter.

getFileChecksum is not designed for comparisons between file systems. Although it is possible to simulate distributed checksums locally, or to craft map-reduce jobs to compute the equivalent of local hashes, I recommend relying on Hadoop’s own integrity checking, which occurs when a file is written to or read from Hadoop

Related Problems and Solutions