Python – Use Python regular expressions to capture occupants and prefixes

Use Python regular expressions to capture occupants and prefixes… here is a solution to the problem.

Use Python regular expressions to capture occupants and prefixes

I’m trying to write a regular expression for Python to capture the various forms of “archipelago” that appear in the corpus.

Here is a test string:

This is my sentence about islands, archipelagos and archipelago space. I want to make sure that the cats of the archipelago are not forgotten. We must not forget the meta-archipelago and proto-archipelago historians, who tend to spell the plural "archipelagoes".

I want to capture the following from a string:

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

Try 1

Use regular expressions (archipelag.*?) \b and use Pythex for testing, I captured a portion of all six forms. But there are also problems:

  1. Archipelago's is captured only as Archipelago. I want possessiveness.
  2. Meta-archipelagic

  3. is captured only as archipelagic. I want to be able to catch prefixes with hyphens.
  4. Protoarchipelagic is captured only as archipelagic. I want to be able to capture non-hyphenated prefixes.

Try 2

If I try to use a regular expression (archipelag.*?) \s(See Pythex), all archipelago's will now be captured, but the first instance of the comma followed by the comma will also be captured (e.g., archipelagos,)。 It failed to fully capture the final 'archipelagoes.'.

Solution

Regular expression ((?:\ b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?) Applies to this. If you have other requirements, you may want to modify it further.

Note that the expressions are grouped using non-capture groups (?:) so that we can use ? to match zero or one

import re

pat = re.compile(r"((?:\ b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)")

corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'"

for match in pat.findall(corpus):
    print(match)

Print

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

Here it is on regex101

Related Problems and Solutions