Java – How to convert byte offsets in sqlite FTS queries to character offsets in java

How to convert byte offsets in sqlite FTS queries to character offsets in java… here is a solution to the problem.

How to convert byte offsets in sqlite FTS queries to character offsets in java

I’m having trouble searching my FTS table in android and the result returns a byte offset :

col     termno      byteoffset      size
1       0           111             4

But the problem is that when using cursor.getString(colNo), it gives me a UTF-16 string, after which I can’t calculate which character of the text is a match.

Here’s a similar problem: Detect character position in an UTF NSString from a byte offset(was SQLite offsets() and encoding problem)

But I can’t understand the solution to the problem. So after knowing the byte offset, how to know exactly the character offset in the string (for highlighting)?

Solution

Encode your string back to the same encoding that Sqlite uses, then extract the parts you want in byte form and convert them back to string:

String chars = cursor.getString(colNo);
byte[] bytes = chars.getBytes("UTF-8");
String prefix = new String(bytes, 0, byteOffset, "UTF-8");
String match = new String(bytes, byteOffset, size, "UTF-8");
int charOffset = prefix.length;
int charSize = match.length;

(This is possible assuming your data is encoded as UTF-8 bytes.) )

Unfortunately, you have to do all this redundant encoding and decoding. It might be worth adding optimizations to simplify plain ASCII common scenarios.

Related Problems and Solutions