Python – How do I make a regular expression that finds the first internal word?

How do I make a regular expression that finds the first internal word?… here is a solution to the problem.

How do I make a regular expression that finds the first internal word?

I want to make a regular expression to capture the first pair of internal words. My following code works in one case but not in another. It captures the last pair of words.

Please check out my code below.

def testReplaceBetweenWords():

head_dlmt='Head'
    tail_dlmt='Tail'

line0 = "abc_Head_def_Head_inner_inside_Tail_ghi_Tail_jkl"
    line1 = "abc_Head_first_Tail_ghi_Head_second_Tail_opq"

between_pattern = "(^.*(?<={0}))(?!. *{0}).*? (?={1}) (.*)$".format(head_dlmt, tail_dlmt)
    compiled_pattern = re.compile(between_pattern)

# Case 0: good case: It captures the first inner words.    
    result0 = re.search(compiled_pattern, line0)  

print("original 0    : {0}".format(result0.group(0)))
    print("expected Head : abc_Head_def_Head")
    print("found Head    : {0}".format(result0.group(1)))
    print("expected Tail :                                Tail_ghi_Tail_jkl")
    print("found Tail    : {0}{1}".format(' ' * (result0.regs[2][0]), result0.group(2)))

print()

# Case 1: Bad case: It captures the last pair words.    
    result1 = re.search(compiled_pattern, line1)

print("original 1    : {0}".format(result1.group(0)))
    print("expected Head : abc_Head")
    print("found Head    : {0}".format(result1.group(1)))
    print("expected Tail :                Tail_ghi_Head_second_Tail_opq")
    print("found Tail    : {0}{1}".format(' ' * (result1.regs[2][0]), result1.group(2)))

The output is as follows.

original 0    : abc_Head_def_Head_inner_inside_Tail_ghi_Tail_jkl
expected Head : abc_Head_def_Head
found Head    : abc_Head_def_Head
expected Tail :                                Tail_ghi_Tail_jkl
found Tail    :                                Tail_ghi_Tail_jkl

original 1    : abc_Head_first_Tail_ghi_Head_second_Tail_opq
expected Head : abc_Head
found Head    : abc_Head_first_Tail_ghi_Head
expected Tail :                Tail_ghi_Head_second_Tail_opq
found Tail    :                                     Tail_opq

The first case works well. It captures the first internal counterwords.
The second case does not work. It captures the last pair of words, but I expect the first pair.
How to make a regular express that meets the above two cases?

Thank you so much.

Solution

Use the following regular expression:

between_pattern = "^((?:(?! {1}).) *{0}).*? ({1}.*)$".format(head_dlmt, tail_dlmt)

See online Python demo regex demo .

Details

  • The first .* pattern should be replaced with an adjusted greedy token (?:(?!) {1}).) * Match any 0+ characters and do not start ending the delimiter character sequence (so you can do it until the last Head that does not contain tail).
  • It doesn’t make sense to use a look around in a capture group, because these patterns will be part of those capture groups

Note that you may wish to use the re. The S flag compiles regular expressions to support strings with line breaks.

Related Problems and Solutions