Use Python regular expressions to capture occupants and prefixes
I’m trying to write a regular expression for Python to capture the various forms of “archipelago” that appear in the corpus.
Here is a test string:
This is my sentence about islands, archipelagos and archipelago space. I want to make sure that the cats of the archipelago are not forgotten. We must not forget the meta-archipelago and proto-archipelago historians, who tend to spell the plural "archipelagoes".
I want to capture the following from a string:
archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes
Try 1
Use regular expressions (archipelag.*?) \b
and use Pythex for testing, I captured a portion of all six forms. But there are also problems:
Archipelago's
is captured only asArchipelago
. I want possessiveness.- is captured only as
archipelagic
.I want to be able to catch prefixes with hyphens.
Protoarchipelagic
is captured only asarchipelagic
. I want to be able to capture non-hyphenated prefixes.
Meta-archipelagic
Try 2
If I try to use a regular expression (archipelag.*?) \s
(See Pythex), all archipelago's
will now be captured, but the first instance of the comma followed by the comma will also be captured (e.g., archipelagos,
)。 It failed to fully capture the final 'archipelagoes.'
.
Solution
Regular expression ((?:\ b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)
Applies to this. If you have other requirements, you may want to modify it further.
Note that the expressions are grouped using non-capture groups (?:)
so that we can use ?
to match zero or one
import re
pat = re.compile(r"((?:\ b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)")
corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'"
for match in pat.findall(corpus):
print(match)
archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes