Extract many URLs in a python data frame… here is a solution to the problem.
Extract many URLs in a python data frame
I have a data frame with text that contains one or more URLs:
user_id text
1 blabla... http://amazon.com ... blabla
1 blabla... http://nasa.com ... blabla
2 blabla... https://google.com ... blabla ... https://yahoo.com ... blabla
2 blabla... https://fnac.com ... blabla ...
3 blabla....
I want to transform this data frame with a URL count for each user ID:
user_id count_URL
1 2
2 3
3 0
Is there an easy way to perform this task in Python?
My code start:
URL = pd. DataFrame(columns=['A','B','C','D','E','F','G'])
for i in range(data.shape[0]) :
for j in range(0,8):
URL.iloc[i,j] = re.findall("(? P<url>https?:/ /[^\s]+)", str(data.iloc[i]))
Thanks
Lionel
Solution
In general, the definition of a URL is much more complex than in the example. Unless you are sure that your URL is very simple, you should look for a good pattern.
import re
URLPATTERN = r'(https?:/ /\S+)' # Lousy, but...
First, extract the URLs from each string and calculate them:
df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
Next, group the count by user ID:
df.groupby('user_id').sum()['urlcount']
#user_id
#1 2
#2 3
#3 0