Python – Regular expressions on strings to match sequences of characters

Regular expressions on strings to match sequences of characters… here is a solution to the problem.

Regular expressions on strings to match sequences of characters

Settings

I have a large number of product images, some of which have the SKU of the product in the file name.

I need to check if the file name contains the SKU of the product.

All SKUs consist of 5 numbers, an underscore, and 2 digits; For example, '10008_01‘, ‘23521_18', etc


My code

I’m using the regular expression setting I found here :

for image in product_image_list:

if re.match(r"^[0-9]{5}$" + '_' + r"^[0-9]{2}$", image):
        print('Match: '+ image)
    else:
        print("NO match: " + image) 

Where,

    image

  • is the name of the image file, such as ‘FINAL 10008_01_angle.jpeg’ or 'FINAL 10008_detail_B.jpeg'
  • product_image_list is a list of images.

Question

The above code does not match, it only produces ‘No match'.

How do I get it to work? IE。 How do I get:

'Match: Final 10008_01_angle.jpeg'

'MISMATCH: FINAL 10008_detail_B.jpeg'

Solution

^[0-9]{5}$_^[0-9]{2}$ Since $ , the pattern will never match any string inside anchor requires the end of the string, but there are more characters to match ( _ , then the beginning of the string, 2 digits and the end of the string).

You need to fix the regular expression pattern to match <5-digit>-<2-digit> substrings without enclosing numbers, and use the pattern method with re.search (because re.match only searches for matches at the beginning of the string):

if re.search(r'(?<!\d)[0-9]{5}_[0-9]{2}(?! \d)', image):

Here,

  • (?<!\d) – (negative backward view) matches the position in the string that does not immediately follow the number
  • [0-9]{5} – 5 digits
  • _ – Underline
  • [0-9]{2} – 2 digits
  • (?! \d) – (negative forward meaning) There can be no numbers to the right of the current position.

See also this regex demo

Print matching used

for image in product_image_list:
    m = re.search(r'(?<!\d)[0-9]{5}_[0-9]{2}(?! \d)', image)
    if m:
        print('Matched SKU: {}'.format(m.group()))
    else:
        print("NO match found in '{}'.". format(image))

To match multiple occurrences, use re.findall:

re.findall(r'(?<!\d)[0-9]{5}_[0-9]{2}(?! \d)', image)

Related Problems and Solutions