Python - Extract many URLs in a python data frame

Extract many URLs in a python data frame… here is a solution to the problem.

Extract many URLs in a python data frame

I have a data frame with text that contains one or more URLs:

user_id          text
  1              blabla... http://amazon.com ... blabla
  1              blabla... http://nasa.com ... blabla
  2              blabla... https://google.com ... blabla ... https://yahoo.com ... blabla
  2              blabla... https://fnac.com ... blabla ...
  3              blabla....

I want to transform this data frame with a URL count for each user ID:

 user_id          count_URL
    1               2 
    2               3
    3               0

Is there an easy way to perform this task in Python?

My code start:

URL = pd. DataFrame(columns=['A','B','C','D','E','F','G'])

for i in range(data.shape[0]) :
  for j in range(0,8):
     URL.iloc[i,j] = re.findall("(? P<url>https?:/ /[^\s]+)", str(data.iloc[i]))

Thanks

Lionel

Solution

In general, the definition of a URL is much more complex than in the example. Unless you are sure that your URL is very simple, you should look for a good pattern.

import re
URLPATTERN = r'(https?:/ /\S+)' # Lousy, but...

First, extract the URLs from each string and calculate them:

df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()

Next, group the count by user ID:

df.groupby('user_id').sum()['urlcount']
#user_id
#1    2
#2    3
#3    0

Python – Extract many URLs in a python data frame

Extract many URLs in a python data frame

Solution

Related Problems and Solutions