Java – hbase How to choose a pre-split strategy and how it affects your row keys

hbase How to choose a pre-split strategy and how it affects your row keys… here is a solution to the problem.

hbase How to choose a pre-split strategy and how it affects your row keys

I’m trying to pre-split an hbase table. One of the HbaseAdmin java APIs is a function that creates an hbase table that is a startkey, endkey, and number of regions. This is the java api void createTable (HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions) I used from HbaseAdmin

Is there a recommendation to choose startkey and endkey based on the dataset?

My approach is to assume that we have 100 records in the dataset. I want the data to be roughly divided into 10 regions, so there are about 10 records per region. So to find the startup key, I would say scan ‘/mytable’, {LIMIT => 10} and select the last row key as my startup key, then scan '/mytable', {LIMIT => 90 } and select the last rowkey as my endkey.

Does this way of finding startkey and rowkey look good or is there a better approach?

Edit
I tried the following method to pre-split an empty table. All three didn’t work the way I used. I think I need to salt the key to get an equal distribution.

PS> I only showed some area information

1)

byte[][] splits = new RegionSplitter.HexStringSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits);

This gives an area with a boundary like this:

{
    "startkey":"-INFINITY",
    "endkey":"11111111",
    "numberofrows":3628951,
},
{
    "startkey":"11111111",
    "endkey":"22222222",
},
{   
    "startkey":"22222222",
    "endkey":"33333333",
},
{
    "startkey":"33333333",
    "endkey":"44444444",
},
{
    "startkey":"88888888",
    "endkey":"99999999",
},
{
    "startkey":"99999999",
    "endkey":"aaaaaaaa",
},
{
    "startkey":"aaaaaaaa",
    "endkey":"bbbbbbbb",
},
{
    "startkey":"eeeeeeee",
    "endkey":"INFINITY",
}

This is useless because my row keys are in composite form like ‘deptId|month|roleId|regionId' and don't fit the above boundaries.

2)

byte[][] splits = new RegionSplitter.UniformSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits)

This has the same problem:

{
    "startkey":"-INFINITY",
    "endkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\x99",
}
{
    "startkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\
    "endkey":"33333332",
}
{
    "startkey":"33333332",
    "endkey":"L\\xCC\\xCC\\xCC\\xCC\\xCC\\xCC\\xCB",
}
{
    "startkey":"\\xE6ffffffa",
    "endkey":"INFINITY",
}

3) I tried to provide the start and end keys, but I got the following useless areas.

hBaseAdmin.createTable(tabledescriptor, Bytes.toBytes("04120|200808|805|1999"),
                               Bytes.toBytes("01253|201501|805|1999"), 10);
{
    "startkey":"-INFINITY",
    "endkey":"04120|200808|805|1999",
}
{
    "startkey":"04120|200808|805|1999",
    "endkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
}
{
    "startkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
    "endkey":"000ptq<200wp6\\xBC805|1999",
}
{
    "startkey":"001\\x11\\x15\\x13\\x1C201\\x15\\x902\\x5C805|1999",
    "endkey":"01253|201501|805|1999",
}
{
    "startkey":"01253|201501|805|1999",
    "endkey":"INFINITY",
}

Solution

First question: In my experience with hbase, I don’t know what are the hard and fast rules for creating the number of regions, start keys, and end keys.

But fundamentally,

With your rowkey design, data should be distributed across the regions and not hotspotted
(36.1. Hotspotting)

However, if you define a fixed number of regions, such as the 10 you mentioned. There may not be 10 after a large data load. If a limit is reached, the number of zones is split again.

In your way of creating table with hbase admin documentation says, creates a new table with the specified number of regions. The specified start key becomes the end key for the first region of the table, and the specified end key becomes the start key for the last region of the table (the start key for the first region is empty and the last range is the empty end key).

Also, I

prefer to create a table via script where the pre-split is 0-10, and I will design a row key so that it is salted and will be on one of the regional servers to avoid hot spots.
Like Enter Image Description Here

Edit: If you want to implement own regionSplit
You can implement and provide your own implementation at org.apache.hadoop.hbase.util.RegionSplitter.SplitAlgorithm and override

public byte[][] split(int numberOfSplits)

Second question:
My understanding :
Do you want to find the start row key and end row key for the inserted data in a specific table… Here’s how.

  • If you want to find the start and end row keys, scan the ".meta" table to see what your start and end row keys look like…

  • You can visit UI http://hbasemaster:60010 you can see how row keys are distributed in each region. For each region, the start and row keys will be there.

  • To know how your keys are organized, after pre-splitting your table and inserting it into HBase… Use FirstKeyOnlyFilter

For example: scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'
Displays all 100 row keys.

If you have a lot of data (not the 100 rows you mentioned) and want to dump all row keys, then you can use the following from an external shell:

echo "scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'" | hbase shell > rowkeys.txt

Related Problems and Solutions