Python 2 - Web Scraping with Regular Expressions


Uploaded by hermanmu on 04.10.2012

Transcript:
Today, I am going to show a quick and dirty way to parse an XML feed using the CSV and regex modules.
The CSV module is used to create a CSV file and write data to the file from the XML feed
while the regular expressions module parses the text from the feed.
I'll also be using the urllib library which is used for fetching the XML data from a url.
Start by importing the required libraries and modules to make them available.
Next, I want to break this script up into six steps.
So start by reading in the XML feed.
I’ll set the variable to xml and use the urlopen function. Then paste in the feed.
Now I need to parse the file and pull the article titles and their subsequent links.
Esentially, the (.*) characters match everything between the two html tags.
Now I need to actually find and store the information desired from the XML file.
Next, I need to iterate through the data to and create a range.
I want to open the CSV file and write the headers.
Finally, I want a For Loop to write the results to the CSV file, row by row.