Python – How to get a better lemma from Spacy

How to get a better lemma from Spacy… here is a solution to the problem.

How to get a better lemma from Spacy

While “PM” can mean “

afternoon (time)”, it can also mean “Prime Minister”.

I want to capture the latter. I hope that the lemma of “PM” returns to “Prime Minister”. How do I do this using spacy?

Returns an example of an unexpected lemma:

>>> import spacy
>>> #nlp = spacy.load('en')
>>> nlp = spacy.load('en_core_web_lg')
>>> doc = nlp(u'PM means prime minister')
>>> for word in doc:
...     print(word.text, word.lemma_)
... 
PM pm
means mean
prime prime
minister minister

According to the document https://spacy.io/api/annotation spacy uses WordNet as a lemma;

A lemma is the uninflected form of a word. The English lemmatization data is taken from WordNet..

When I try to create a when you enter “pm” in Wordnet, it displays “Prime Minister” as one of the lemmas.

What am I missing here?

Solution

I think clarifying some common NLP tasks helps answer your question.

Lemmatization is the process of finding canonical words given different inflections. For example, run, runs, ran, and running are the same morpheme form: run. If you are doing lemmatization for run, runs, and ran, the output will be run. In your example sentence, notice how it reduces the means formism to mean.

Given this, the task you want to perform doesn’t sound like a lemma. Using a silly counterexample might help solidify the idea: what are the different variants of the hypothesis lemma “pm”: pming, pmed, pms? These are not actual words.

It sounds like your task may be closer to named entity recognition (NER), which you can also do in spaCry. To iterate through the detected entities in a resolved document, you can use the .ents attribute as follows:

>>> for ent in doc.ents:
...     print(ent, ent.label_)

According to the sentence you gave, spacy (v. 2.0.5) did not detect any entities. If you replace “PM” with “P.M.” It detects it as an entity, but as a GPE.

The best course of action depends on your task, but if you want to do the desired classification of the “PM” entity, I’ll look at setting entity If you want to extract every mention of “PM” from a large number of documents, < a href="https://spacy.io/usage/processing-pipelines#component-example2" rel="noreferrer noopener nofollow"> use the matcher in a pipeline .

Related Problems and Solutions