Java – Apache Pig – MATCHES with multiple match conditions

Apache Pig – MATCHES with multiple match conditions… here is a solution to the problem.

Apache Pig – MATCHES with multiple match conditions

I’m trying to take a logical match criterion such as:

(("Foo" OR "Foo Bar" OR FooBar) AND ("test" OR "testA" OR "TestB")) OR TestZ

And apply it to match the file in pig

result = filter inputfields by text matches (some regex expression here));

The problem is that I don’t know how to convert the logical expression above to a regular expression for the matches method.

I’ve fiddled with all sorts of things, and the closest I’m to is this:

((?=.*?\bFoo\b | \bFoo Bar\b))(?=.*?\bTestZ\b)

Any ideas? I also need to try this conversion programmatically if possible.

Some examples:

A – Fast brown Foo skips the inert test (this should pass because it contains the FOO and the test).

b – Something happened in TestZ (this also passed because it contains testZ).

c – Fast brown Foo skips the lazy dog (this should fail because it contains Foo but not test, testA, or TestB).

Thanks

Solution

Because you’re using Pig, you don’t actually need to involve regular expressions, you can just use the boolean operator and a few simple regular expressions provided by Pig, such as:

T = load 'matches.txt' as (str:chararray);
F = filter T by ((str matches '.*(Foo| Foo Bar| FooBar).*' and str matches '.*(test|testA| TestB).*') or str matches '.*TestZ.*');
dump F;

Related Problems and Solutions