analitics

Pages

Thursday, June 20, 2019

Python 3.7.3 : Read and save RSS data from goodreads website.

Today I will show you how to parse data from www.goodreads.com using the feedparser and save all into CSV file.
The Goodreads website comes with hundreds of great book recommendations from fellow readers, beloved authors, and let you add your favorite books.
The main goal was to have a structured link from the RSS file and the CSV file.
This issue was solve with arrays for each type of data.
First, let's install the feedparser python module with the pip tool:
C:\Python373\Scripts>pip install feedparser
Collecting feedparser
...
Successfully built feedparser
Installing collected packages: feedparser
Successfully installed feedparser-5.2.1
You need to get the RSS link with your books from your account.
The example is simple and has commented lines to understand easily how can I solve this issue.
This is the source code for reading all RSS data and put on the CSV file:
import feedparser
import csv
bookread_rss = "your RSS with data account"
feeds = feedparser.parse(bookread_rss)
print ("aditional RSS data")
print (feeds['feed']['title'])
print (feeds['feed']['link'])
print (feeds.feed.subtitle)
print (len(feeds['entries']))
print (feeds.version)
print (feeds.headers)
print (feeds.headers.get('content-type'))
print ("read RSS items")
# empty arrays for values by type
dates = []
titles = []
authors = []
links = []
pages =[]

# create the name of the CSV file
file_csv = 'my_goodreads_books.csv'

# prepare the CSV file with fix for error
# UnicodeEncodeError: 'charmap' codec can't encode character '\u0435' in position
# 30: character maps to 
cvs_out = csv.writer(open(file_csv, 'w',newline='',encoding="utf-8"))

#print(feeds)
for post in feeds.entries:
    date = "%d/%02d/%02d" % (post.published_parsed.tm_year,\
        post.published_parsed.tm_mon, \
        post.published_parsed.tm_mday,)
    # uncomment and will print on console
    #print("___")
    #print("post date: " + date)
    #print("post title: " + post.title)
    #print("post author: " + post.author_name)
    #print("post link: " + post.link)
    #print("post pages: " + post.num_pages)

    dates.append(date)
    titles.append(post.title)
    authors.append(post.author_name)
    links.append(post.link)
    pages.append(post.num_pages)

for d,t,a,l,p in zip(dates,titles,authors,links,pages):
    cvs_out.writerow((d,t,a,l,p))
The result will print you some info, see my example:
C:\Python373>python bookreader_rss_001.py
aditional RSS data
Catalin's bookshelf: all
https://www.goodreads.com/review/list_rss/52019632?key=pyfTLqvJXpg-_ghi4a6ZTZfJV
gLVXC8TcWyaBSyoiScgfXq3&shelf=%23ALL%23
Catalin's bookshelf: all
100
rss20
{'Server': 'Server', 'Date': 'Thu, 20 Jun 2019 12:22:55 GMT', 'Content-Type': 'a
pplication/xml; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'c
lose', 'Status': '200 OK', 'X-Frame-Options': 'ALLOWALL', 'X-XSS-Protection': '1
...
All date will be put into the my_goodreads_books.csv file.