Python – How to find the longest match in python of a string containing the word in focus

How to find the longest match in python of a string containing the word in focus… here is a solution to the problem.

How to find the longest match in python of a string containing the word in focus

Python/programming newbie, so not quite sure how to express this….

What I want to do is this: enter a sentence, find all matches between the entered sentence and a set of stored sentences/strings, and return the longest combination of matching strings.

I

think the answer has something to do with regular expressions, but I haven’t started yet and don’t want to if I don’t need to.

My question: Are regular expressions the solution to this problem? Or is there a way to do this without importing anything?

If it helps you understand my problem/thoughts, here’s the pseudocode I’m trying to do:

input = 'i play soccer and eat pizza on the weekends'
focus_word = 'and'

ss = [
      'i play soccer and baseball',
      'i eat pizza and apples',
      'every day i walk to school and eat pizza for lunch',
      'i play soccer but eat pizza on the weekend',
     ]

match = MatchingFunction(input, focus_word, ss)
# input should match with all except ss[3]

ss[0]match= 'i play soccer and'
ss[1]match = 'and'
ss[2]match = 'and eat pizza'

#the returned value match should be 'i play soccer and eat pizza'

Solution

It sounds like you’re looking to find longest common substring between your input string and each string in the database. Suppose you have an LCS function that can find the longest public substring of two strings, and you can do this:

> [LCS(input, s) for s in ss]
['i play soccer and ',
 ' eat pizza ',
 ' and eat pizza ',
 ' eat pizza on the weekend']

Then, it sounds like you’re looking for the most repeated substring in the string list. (Correct me if I’m wrong, but I’m not quite sure what you’re looking for in general!) Based on the array output above, which string combination will you use to create the output string?


Based on your comments, I think this should fix the problem :

> parts = [s for s in [LCS(input, s) for s in ss] if s.find(focus_word) > -1]
> parts
['i play soccer and ', ' and eat pizza ']

Then, get rid of the repetition in this example:

> "".join([parts[0]] + [p.replace(focus_word, "").strip() for p in parts[1:]])
'i play soccer and eat pizza'

Related Problems and Solutions