Python regular expressions get everything until an expression like ” (years)”
I have a data frame column called “movie_title” that contains the movie name and year. Here are the two types of movie titles in the column above.
title1='Toy Story (1995)'
title2='City of Lost Children, The (Cité des enfants perdus, La) (1995)'
I’d like to split it into two columns with the title and the year of release. I was able to successfully extract the year using the following regular expression:
re.findall('[1-2][0-9]{3}', string)[0]
Need help writing another regular expression that extracts the title (excluding year information and parentheses).
For example, title1 and title2 should look like this:
title1='Toy Story'
title2='City of Lost Children, The (Cité des enfants perdus, La)'
Solution
This pretty much solves the problem:
.(?:[^\((0-9)])+
You just need to get rid of the trailing it doesn’t capture).
Will update this answer if I find something better.
Another idea: If you’re sure the year will appear at the end of every movie title, why not remove the last digit? So remove (xxxx)
from every movie string you have?