Python regular expressions get everything until an expression like ” (years)”

Python regular expressions get everything until an expression like ” (years)” … here is a solution to the problem.

Python regular expressions get everything until an expression like ” (years)”

I have a data frame column called “movie_title” that contains the movie name and year. Here are the two types of movie titles in the column above.

title1='Toy Story (1995)'
title2='City of Lost Children, The (Cité des enfants perdus, La) (1995)'

I’d like to split it into two columns with the title and the year of release. I was able to successfully extract the year using the following regular expression:

re.findall('[1-2][0-9]{3}', string)[0]

Need help writing another regular expression that extracts the title (excluding year information and parentheses).

For example, title1 and title2 should look like this:

title1='Toy Story'
title2='City of Lost Children, The (Cité des enfants perdus, La)'

Solution

This pretty much solves the problem:

.(?:[^\((0-9)])+

You just need to get rid of the trailing it doesn’t capture). Will update this answer if I find something better.

Another idea: If you’re sure the year will appear at the end of every movie title, why not remove the last digit? So remove (xxxx) from every movie string you have?

Related Problems and Solutions