Learn about Hadoop file system counters
I want to know about file system counters in Hadoop.
Here are the counters for the jobs I ran.
In every job I run, I observed that the number of mapped file bytes read was almost equal to the number of HDFS bytes read. And I observed that the number of file bytes written by Map is the sum of the number of file bytes read by Mapper and the number of HDFS bytes. Please help! Are local files and HDFS reading the same data, and both are being written to the local file system during the mapping phase?
So the answer is really what you noticed is job-specific. The mapper/reducer will write more or fewer bytes to the local file than HDFS, depending on the job.
In your mapper case, you read similar amounts of data from local and HDFS locations, which is fine. Your Mapper code happens to need to read the same amount of data locally as it does from HDFS. Most of the time, Mappers is used to analyze larger amounts of data than its RAM, so it’s not surprising to see that it might write data fetched from HDFS to a local drive. The number of bytes read from HDFS and locally doesn’t always look like the sum of the local write sizes (not even in your case).
This is an example using TeraSort, with 100G data and 1 billion key/value pairs.
File System Counters FILE: Number of bytes read=219712810984 FILE: Number of bytes written=312072614456 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=100000061008 HDFS: Number of bytes written=100000000000 HDFS: Number of read operations=2976 HDFS: Number of large read operations=0
Notes. The number of bytes read and written from HDFS is almost exactly 100G. This is because 100G needs to be sorted, and the final sorted file needs to be written. Also note that it requires a lot of local read/write operations to save and sort the data, 2x and 3x the amount of data it reads!
One last point, unless you just want to run a job and don’t care about the outcome. The number of HDFS bytes written should never be 0, and yours is