Java – How to scan and delete millions of rows in HBase

How to scan and delete millions of rows in HBase… here is a solution to the problem.

How to scan and delete millions of rows in HBase

What happened
All data from the last month is corrupted due to a system error. So we have to manually delete and re-enter these records. Basically, I want to delete all rows inserted within a certain time period. However, I find it difficult to scan and delete millions of rows in HBase.

Possible solutions
I found two ways to bulk delete :
The first is to set up a TTL so that the system automatically deletes all obsolete records. But I want to keep records inserted before last month, so this solution doesn’t work for me.

The second option is to write a client using the Java API:

 public static void deleteTimeRange(String tableName, Long minTime, Long maxTime) {
    Table table = null;
    Connection connection = null;

try {
        Scan scan = new Scan();
        scan.setTimeRange(minTime, maxTime);
        connection = HBaseOperator.getHbaseConnection();
        table = connection.getTable(TableName.valueOf(tableName));
        ResultScanner rs = table.getScanner(scan);

List<Delete> list = getDeleteList(rs);
        if (list.size() > 0) {

table.delete(list);
        }
    } catch (Exception e) {
        e.printStackTrace();

} finally {
        if (null != table) {
            try {
                table.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

if (connection != null) {
            try {
                connection.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

}

private static List<Delete> getDeleteList(ResultScanner rs) {

List<Delete> list = new ArrayList<>();
    try {

for (Result r : rs) {
            Delete d = new Delete(r.getRow());
            list.add(d);
        }
    } finally {
        rs.close();
    }
    return list;
}

But in this approach, all records are stored in ResultScanner rs, so the heap size will be large. If the program crashes, it has to start from scratch.
So, is there a better way to achieve this goal?

Solution

Not sure how many “millions” you’ve processed in your table, but the easiest thing to do is not to try to put them all in a List at once, but to do a manageable step using the .next(n) function in multiple lists. Like this:

for (Result row : rs.next(numRows))
{
Delete del = new Delete(row.getRow());
...
}

This allows you to control how many rows are returned from the server through a single RPC through the numRows parameter. Make sure it’s large enough to avoid too many round trips to the server, but at the same time not too big to kill your heap. You can also use BufferedMutator to operate on multiple deletes at once.

Hope this helps you.

Related Problems and Solutions