HDFS file checksum
I’m trying to use the Hadoop API – DFSCleint.getFileChecksum() to check file consistency after copying to HDFS.
I get the following output of the above code:
Null
HDFS : null
Local : null
Who can point out errors or errors?
Here is the code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileChecksum;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
public class fileCheckSum {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
TODO Auto-generated method stub
Configuration conf = new Configuration();
FileSystem hadoopFS = FileSystem.get(conf);
Path hdfsPath = new Path("/derby.log");
LocalFileSystem localFS = LocalFileSystem.getLocal(conf);
Path localPath = new Path("file:///home/ubuntu/derby.log");
System.out.println("HDFS PATH : "+hdfsPath.getName());
System.out.println("Local PATH : "+localPath.getName());
FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log"));
FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log"));
if(null!=hdfsChecksum || null!=localChecksum){
System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+"\t"+hdfsChecksum.getLength());
System.out.println("Local Checksum : "+localChecksum.toString()+"\t"+localChecksum.getLength());
if(hdfsChecksum.toString().equals(localChecksum.toString())){
System.out.println("Equal");
}else{
System.out.println("UnEqual");
}
}else{
System.out.println("Null");
System.out.println("HDFS : "+hdfsChecksum);
System.out.println("Local : "+localChecksum);
}
}
}
Solution
Since you don’t set a remote address on the conf
and essentially use the same configuration, both hadoopFS
and localFS
point to instances of LocalFileSystem
.
getFileChecksum
is not implemented for LocalFileSystem
and returns null. However, it should work for DistributedFileSystem
, and if your conf points to a distributed cluster,
FileSystem.get(conf)
should return an instance MD5 of MD5 of CRC32 checksums The size of the DistributedFileSystem
is The block of bytes.per.checksum.
The value depends on the block size and the cluster-wide configuration bytes.per.checksum.
This is why these two parameters are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is the number of CRC checksums per block and yyy is byte. The per.checksum parameter.
getFileChecksum
is not designed for comparisons between file systems. Although it is possible to simulate distributed checksums locally, or to craft map-reduce jobs to compute the equivalent of local hashes, I recommend relying on Hadoop’s own integrity checking, which occurs when a file is written to or read from Hadoop