Python – How do I extract and print substrings that match one of these RegEx patterns while maintaining the order in which they appear in the original input string?

How do I extract and print substrings that match one of these RegEx patterns while maintaining the order in which they appear in the original input string?… here is a solution to the problem.

How do I extract and print substrings that match one of these RegEx patterns while maintaining the order in which they appear in the original input string?

I’m having some issues printing the required elements from a string because these required elements respond to different patterns (in this case I’m simplifying it to 3 regular expression patterns). As shown in the following example, the goal is to print what is stored in the substring_to_extract_1 or/and substring_to_extract_2 variables, and these extracts must be printed in the console in the order in which they appear in the input string (which is tricky because there are 3 patterns to check in the read loop).

import re

input_text = "quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas" #example 1

finding loop:

continue_comparing = True

# 1) The least restrictive RegEx, not too restrictive but must be extracted by 2 substrings
    #    examples: "entre las 19hs y las 20 30", "entre las 18 y las 20", "entre la 1 y las 15hs"
    regex_1 = "(?:apartir de |de|entre |desde )\s*(?:las|la )\s*" substring_to_extract_1 "\s*(?:de la tarde|de la noche|de la mañana|) \s*(?:y las |y la |hasta las |hasta la |hasta )\s*" substring_to_extract_2 "\s*(?:de la tarde|de la noche|de la mañana|)"
    if (regex_2 == True and continue_comparing == True):
        print(repr(substring_to_extract_1))
        print(repr(substring_to_extract_2))
        continue_comparing = False
    
# 2) The intermediate restrictive RegEx, but it must be indicated if it is "am" or "pm"
    #    examples:  "las 18:00am", "las 1800pm", "las 1800 p.m.", "las 08 00 am", "la 01 00 a.m.", "las 19 15 pm", "las 23 pm"
    regex_2 = "(?:las|la )\s*" substring_to_extract_1 "\s*(?:a.m.| a.m|am.| am|a m|p.m.| p.m|pm.| pm|p m)\s*(?:de la tarde|de la noche|de la mañana|)"
    if (regex_1 == True and continue_comparing == True):
        print(repr(substring_to_extract_1))
        continue_comparing = False

# 3) The last and the most restrictive RegEx, it should only extract a substring like the first regex, but this requires that this substring be preceded by more things
    #    examples: "a las 17", "a las 6 y 15", "desde las 15 hs", "a las 15 y 45 hs", "hasta las 17 00", "antes de las 16:06"
    regex_3 = "(?:a eso de |a esas de |despues de |antes de |hasta |tipo |desde |apartir de |de |a )\s*(?:las|la )\s*" substring_to_extract_1 "\s*(?:de la tarde|de la noche|de la mañana|)"
    if (regex_3 == True and continue_comparing == True):
        print(repr(substring_to_extract_1))
        continue_comparing = False

In the regular expression pattern, I

specify the location where the data is extracted as substring_to_extract_1, but for example for the first regular expression, I think it is r'(\d{1,2}\s*(?::| ) \s*\d{1,2}\s*(?:am|pm)' will work. Then I can extract these matches using the function .groups().

For the other 2 regular expressions, I’m not sure because they depend a lot on the structure of the reading loop.


Some examples of input strings that might be parsed:

Example 1:

input_text = "quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas"

In this case, you must extract the following substrings one by one and print them in the order in which they appear….

Output I need:

"de 06 00 am"   <--- was extracted by the second regex pattern
"a 11 59"       <--- was extracted by the third regex pattern

Example 2:

input_text = "a eso de las 6 de la tarde o las 19 15 pm deberiamos estar alli, no crees? recuerda que desde las 15 hs ya empieza el evento, pero podemos estar alli antes de las 14 30 hs, aunque eso depende porque el show recien empezara entre las 15 hs y las 18pm"

Output I need:

"de las 6 de la tarde"    <--- was extracted by the third regex pattern
"las 19 15 pm"            <--- was extracted by the second regex pattern
"desde las 15"            <--- was extracted by the third regex pattern
"de las 14 30"            <--- was extracted by the third regex pattern
"entre las 15"            <--- was extracted by the first regex pattern
"y las 18pm"              <--- was extracted by the first regex pattern

Example 3:

input_text = "Hay que estar presentes entre las 19hs y las 20 30 hs, y seguro salimos a las 23 pm"

Output I need:

"entre las 19"            <--- was extracted by the first regex pattern
"y las 20 30"             <--- was extracted by the first regex pattern
"a las 23 pm"             <--- was extracted by the second regex pattern

Example 4:

input_text = "A las 19 salimos!! es importante llegar alla antes de las 20 30 hs"

Output I need:

"A las 19"                <--- was extracted by the third regex pattern
"de las 20 30"            <--- was extracted by the third regex pattern

Example 5:

input_text = "A las 19:30 salimos!! es importante llegar alla antes de las 20 30 hs, ya que a las 21: pm cierran algunos negocios, sin embargo el cine esta abierto hasta las 23:30 pm de la noche"

Output I need:

"A las 19:30"                    <--- was extracted by the third regex pattern
"de las 20 30"                   <--- was extracted by the third regex pattern
"a las 21: pm"                   <--- was extracted by the second regex pattern
"hasta las 23:30 pm de la noche" <--- was extracted by the second regex pattern

How should I read the input string? So how should I structure a block of code that allows us to evaluate patterns in these cases?

Solution

Try (regex101):

import re

test_cases = [
    "a eso de las 6 de la tarde o las 19 15 pm deberiamos estar alli, no crees? recuerda que desde las 15 hs ya empieza el evento, pero podemos estar alli antes de las 14 30 hs, aunque eso depende porque el show recien empezara entre las 15 hs y las 18pm",
    "quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas",
    "Hay que estar presentes entre las 19hs y las 20 30 hs, y seguro salimos a las 23 pm",
    "A las 19 salimos!! es importante llegar alla antes de las 20 30 hs",
]

pat = re.compile(
    r"\b(?:de las|entre las|desde las|y las|a las|las|de|a)\s+\d+(?:\ s+\d+)?\s*(?:pm|am|de la tarde)?",
    flags=re. I,
)

for t in test_cases:
    x = pat.findall(t)
    print(t)
    print("-" * 80)
    print(*map(str.strip, x), sep="\n")
    print()

Print:

a eso de las 6 de la tarde o las 19 15 pm deberiamos estar alli, no crees? recuerda que desde las 15 hs ya empieza el evento, pero podemos estar alli antes de las 14 30 hs, aunque eso depende porque el show recien empezara entre las 15 hs y las 18pm
--------------------------------------------------------------------------------
de las 6 de la tarde
las 19 15 pm
desde las 15
de las 14 30
entre las 15
y las 18pm

quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas
--------------------------------------------------------------------------------
de 06 00 am
a 11 59

Hay que estar presentes entre las 19hs y las 20 30 hs, y seguro salimos a las 23 pm
--------------------------------------------------------------------------------
entre las 19
y las 20 30
a las 23 pm

A las 19 salimos!! es importante llegar alla antes de las 20 30 hs
--------------------------------------------------------------------------------
A las 19
de las 20 30


Edit: Save it as a substring:

out = []
for t in test_cases:
    x = pat.findall(t)
    out.append(list(map(str.strip, x)))

print(out)

Print:

[
    [
        "de las 6 de la tarde",
        "las 19 15 pm",
        "desde las 15",
        "de las 14 30",
        "entre las 15",
        "y las 18pm",
    ],
    ["de 06 00 am", "a 11 59"],
    ["entre las 19", "y las 20 30", "a las 23 pm"],
    ["A las 19", "de las 20 30"],
]

Edit 2: Use ::

import re

test_cases = [
    "a eso de las 6 de la tarde o las 19 15 pm deberiamos estar alli, no crees? recuerda que desde las 15 hs ya empieza el evento, pero podemos estar alli antes de las 14 30 hs, aunque eso depende porque el show recien empezara entre las 15 hs y las 18pm",
    "quizas seria mejor ir de 06 00 am a 11 59 ya que me gusta viajar a esas horas",
    "Hay que estar presentes entre las 19hs y las 20 30 hs, y seguro salimos a las 23 pm",
    "A las 19 salimos!! es importante llegar alla antes de las 20 30 hs",
    "A las 19:30 salimos!! es importante llegar alla antes de las 20 30 hs, ya que a las 21: pm cierran algunos negocios, sin embargo el cine esta abierto hasta las 23:30 pm de la noche",
]

pat = re.compile(
    r"\b(?:de las|entre las|desde las|y las|a las|las|de|a)\s+\d+(?:[\s:]+)? (?:\ d+)?\s*(?:pm|am|de la tarde)?",
    flags=re. I,
)

out = []
for t in test_cases:
    x = pat.findall(t)
    out.append(list(map(str.strip, x)))

print(out)

Print:

[
    [
        "de las 6 de la tarde",
        "las 19 15 pm",
        "desde las 15",
        "de las 14 30",
        "entre las 15",
        "y las 18pm",
    ],
    ["de 06 00 am", "a 11 59"],
    ["entre las 19", "y las 20 30", "a las 23 pm"],
    ["A las 19", "de las 20 30"],
    ["A las 19:30", "de las 20 30", "a las 21: pm", "las 23:30 pm"],
]

Related Problems and Solutions