Apache Pig – MATCHES with multiple match conditions
I’m trying to take a logical match criterion such as:
(("Foo" OR "Foo Bar" OR FooBar) AND ("test" OR "testA" OR "TestB")) OR TestZ
And apply it to match the file in pig
result = filter inputfields by text matches (some regex expression here));
The problem is that I don’t know how to convert the logical expression above to a regular expression for the matches method.
I’ve fiddled with all sorts of things, and the closest I’m to is this:
((?=.*?\bFoo\b | \bFoo Bar\b))(?=.*?\bTestZ\b)
Any ideas? I also need to try this conversion programmatically if possible.
Some examples:
A – Fast brown Foo skips the inert test (this should pass because it contains the FOO and the test).
b – Something happened in TestZ (this also passed because it contains testZ).
c – Fast brown Foo skips the lazy dog (this should fail because it contains Foo but not test, testA, or TestB).
Thanks
Solution
Because you’re using Pig, you don’t actually need to involve regular expressions, you can just use the boolean operator and a few simple regular expressions provided by Pig, such as:
T = load 'matches.txt' as (str:chararray);
F = filter T by ((str matches '.*(Foo| Foo Bar| FooBar).*' and str matches '.*(test|testA| TestB).*') or str matches '.*TestZ.*');
dump F;