Google News Parser in Python

Use at own risk. Google News may not tolerate parsing of its web site.

In this article, I present a Python that grabs the HTML from Google News and extracts links from specified sections.

The script makes use of the urllib and sgmllib libraries, both of which are included in the standard Python package.

One of the quirk used is subclassing FancyURLopener so that the user-agent can be masqueraded since Google blocks attempts to parse its pages.

This is done in the following code:

Another neat addition is caching. The expiry time of the cache file is set by the constant CACHE_DELAY.

The parsing code is abstracted in the class GoogleNewsParser, a subclass of SGMLParser that, as its name implies, allows parsing the HTML code. The concept is simple. There are various tag handlers that are executed when the parser comes across the corresponding tags. For example:

And, later:

Because of the abstraction, this simple code needs to be called from any program:

The complete source code is here.

One Reply to “Google News Parser in Python”

  1. Up until recently I was using a similar script written in PHP that generated an RSS feed. This feed was showing up on a public web page.

    A week ago I received a message from Google asking me to remove the feed from the public page.

    Does anyone know what regulations are there in this regard?


