How to convert byte offsets in sqlite FTS queries to character offsets in java
I’m having trouble searching my FTS table in android and the result returns a byte offset :
col termno byteoffset size
1 0 111 4
But the problem is that when using cursor.getString(colNo),
it gives me a UTF-16 string, after which I can’t calculate which character of the text is a match.
Here’s a similar problem: Detect character position in an UTF NSString from a byte offset(was SQLite offsets() and encoding problem)
But I can’t understand the solution to the problem. So after knowing the byte offset, how to know exactly the character offset in the string (for highlighting)?
Solution
Encode your string back to the same encoding that Sqlite uses, then extract the parts you want in byte form and convert them back to string:
String chars = cursor.getString(colNo);
byte[] bytes = chars.getBytes("UTF-8");
String prefix = new String(bytes, 0, byteOffset, "UTF-8");
String match = new String(bytes, byteOffset, size, "UTF-8");
int charOffset = prefix.length;
int charSize = match.length;
(This is possible assuming your data is encoded as UTF-8 bytes.) )
Unfortunately, you have to do all this redundant encoding and decoding. It might be worth adding optimizations to simplify plain ASCII common scenarios.