Reddit PRAW API : Extracting entire JSON format
I’m using The Reddit API Praw for sentiment analysis. My code is as follows:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import praw
from IPython import display
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from pprint import pprint
import pandas as pd
import nltk
import seaborn as sns
import datetime
sns.set(style='darkgrid', context='talk', palette='Dark2')
reddit = praw. Reddit(client_id='XXXXXXXXXXX',
client_secret='XXXXXXXXXXXXXXXXXXX',
user_agent='StackOverflow')
headlines = set()
results = []
sia = SIA()
for submission in reddit.subreddit('bitcoin').new(limit=None):
pol_score = sia.polarity_scores(submission.title)
pol_score['headline'] = submission.title
readable = datetime.datetime.fromtimestamp(submission.created_utc).isoformat()
results.append((submission.title, readable, pol_score["compound"]))
display.clear_output()
Question A: With this code, I can only extract the title of the text and a few other keys. I wanted to extract everything in JSON format, but researched the documentation I haven’t seen yet to see if it’s possible.
If I just call submission in reddit.subreddit(‘bitcoin’), the result will only show the ID code. I want to extract all the information and save it in a JSON file.
Question B: How do I extract comments/messages for a specific date?
Solution
Question 1:
You can simply add .json
to the end of the full URL of the post to get the full JSON of that page, which includes the title, author, comments, votes, and everything else.
Use submission.permalink
to get the full URL of the post. You can use requests
to get the JSON for that page.
import requests
url = submission.permalink
response = requests.get('http' + url + '.json')
json = response.content # your Json
Question B:
Unfortunately, Reddit removed timestamp search from their search API sometime last year. It’s a announcement post about it.
Besides some minor syntax differences, the most notable change is that searches by exact timestamp are no longer supported on the newer system. Limiting results to the past hour, day, week, month and year is still supported via the ?t= parameter (e.g. ?t=day)
Therefore, this cannot be done using Praw
at this time. But you can take a look at the the Pushshift api, which provides this functionality.