Google News Parser in Python
March 27, 2004 on 9:37 am | In Python |
NOTE:
Use at own risk. Google News may not tolerate parsing of its web site.
In this article, I present a Python that grabs the HTML from Google News and extracts links from specified sections.
The script makes use of the urllib and sgmllib libraries, both of which are included in the standard Python package.
One of the quirk used is subclassing FancyURLopener so that the user-agent can be masqueraded since Google blocks attempts to parse its pages.
This is done in the following code:
class MyURLOpener(urllib.FancyURLopener):
def __init__(self, *args):
self.version = "Mozilla//5.001 (windows; U; NT4.0; en-us) Gecko//25250101"
apply(urllib.FancyURLopener.__init__, (self,) + args)
Another neat addition is caching. The expiry time of the cache file is set by the constant CACHE_DELAY.
CACHE_DELAY = 600 # in seconds
The parsing code is abstracted in the class GoogleNewsParser, a subclass of SGMLParser that, as its name implies, allows parsing the HTML code. The concept is simple. There are various tag handlers that are executed when the parser comes across the corresponding tags. For example:
def start_a(self, attrs):
for k, v in attrs:
# ignore all entries except those for the category we are looking for
if k == "name" and v == self.currentCategory:
self.categoryOn = 1
break
elif k == "name" and v != self.currentCategory and self.categoryOn == 1:
self.categoryOn = 0
self.setnomoretags()
if self.categoryOn == 1:
for k, v in attrs:
# look for external links. when we find one, we start reading its title
if k == "href" and re.search("^#", v) == None and re.search("^/news?", v) == None and re.search("^../", v) == None:
self.currentAnchorHREF = v
self.dataOn = 1
else:
self.currentAnchoreHREF = ""
self.dataOn = 0
And, later:
def end_a(self):
if self.categoryOn == 1 and self.dataOn == 1:
self.addUrl()
self.dataOn = 0
Because of the abstraction, this simple code needs to be called from any program:
# MAIN PROGRAM
parser = GoogleNewsParser()
parser.process(GOOGLE_URL, "REGION")
# Print the links
links = parser.getHrefs()
for k in links:
print k[0] + ">>" + k[1]
The complete source code is here.
Related Posts:
- wxPython on Panther
- Beagle Dynamic Desktop Search Tool
- IM Online Status Indicators
- MSN toobar Suite Beta
1 Comment »
RSS feed for comments on this post.
Leave a comment
Powered by blog.mu with Pool theme design by Borja Fernandez.


Up until recently I was using a similar script written in PHP that generated an RSS feed. This feed was showing up on a public web page.
A week ago I received a message from Google asking me to remove the feed from the public page.
Does anyone know what regulations are there in this regard?
Thanks,
Marius
Comment by Marius Scurtescu — 28 March 2004 #