HBase scan operation cache
What is the difference between setCiting and setBatch on the HBase scanning mechanism?
What do I have to use to get the best performance during scanning large amounts of data?
Solution
Unless you have very wide tables with many columns (or very large columns), you should forget about setBatch() altogether and focus on setCaching():
setCaching(int cache).
Sets the number of cached rows that will be delivered to the scanner. If not, the configuration settings HConstants.HBASE_CLIENT_SCANNER_CACHING are applied. A higher cache value enables a faster scanner but uses more memory.
setBatch(int batch)
Sets the maximum number of values returned by next() per call
setBatch is about the number of row values that should be returned per call/iteration. Here’s a good post about it: http://blog.jdwyah.com/2013/08/hbase-scan-batch-vs-cache.html