Python – Break word strings according to custom dictionaries

Break word strings according to custom dictionaries… here is a solution to the problem.

Break word strings according to custom dictionaries

I want to mark a list of strings based on my custom dictionary.

The list of strings is as follows:

lst = ['vitamin c juice', 'organic supplement'] 

Custom dictionary:

dct = {0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'}

My expected result:

Vitamin C juice –> [(3,1), (1,1)].
Organic Supplements –> [(0,1), (2,1)].

My current code:

import gensim
import gensim.corpora as corpora
from gensim.utils import tokenize
dct = corpora. Dictionary([list(x) for x in tup_list]) 
corpus = [dct.doc2bow(text) for text in [s for s in lst]]

The error message I get is TypeError: doc2bow expects an array of unicode tokens on input, not a single string However, I don’t want to simply label “vitamin c” as vitamin and c. Instead, I want to segment words based on my existing dct words. That is, it should be vitamin C.

Solution

You first have to reverse the dictionary so that the keyword becomes key. You can then use regular expressions to break down the list’s entries into keywords. Then use the keywords to find the corresponding token against the reverse dictionary

For example:

lst = ['vitamin c juice', 'organic supplement'] 
dct = {0: 'organic', 1: 'juice', 2: 'supplement', 3: 'vitamin c'}

import re
from collections import Counter
keywords      = { keyword:token for token,keyword in dct.items() }  # inverted dictionary
sortedKw      = sorted(keywords,key=lambda x:-len(x))               # keywords in reverse order of size
pattern       = re.compile( "|". join(sortedKw) )                    # regular expression
lstKeywords   = [ pattern.findall(item) for item in lst ]           # list items --> keywords
tokenGroups   = [ [keywords[word] for word in words] for words in lstKeywords ]  # keyword lists to lists of indexes
result        = [ list(Counter(token).items()) for token in tokenGroups ] # lists of token indexes to (token,count) 
print(result) # [[(3, 1), (1, 1)], [(0, 1), (2, 1)]]

Regular expressions take the form: keyword1| keyword2| keyword3

Because “|” The operators in regular expressions are never greedy, and longer keywords must appear in the list first. That’s why expressions are sorted before they are built.

After that, simply convert the list items to a keyword list (re.findall() does just that) and then use an inverted dictionary to convert each keyword to a tag index.

[UPDATE] To count the number of occurrences of a tag, a list of keywords converted to a tag index list is loaded into a count operation in a specialized dictionary in the counter object (from the collection module) that is performed.

Related Problems and Solutions