Google News Parser in Python

Use at own risk. Google News may not tolerate parsing of its web site.

In this article, I present a Python that grabs the HTML from Google News and extracts links from specified sections.

The script makes use of the urllib and sgmllib libraries, both of which are included in the standard Python package.

One of the quirk used is subclassing FancyURLopener so that the user-agent can be masqueraded since Google blocks attempts to parse its pages.

This is done in the following code:

Another neat addition is caching. The expiry time of the cache file is set by the constant CACHE_DELAY.

The parsing code is abstracted in the class GoogleNewsParser, a subclass of SGMLParser that, as its name implies, allows parsing the HTML code. The concept is simple. There are various tag handlers that are executed when the parser comes across the corresponding tags. For example:

And, later:

Because of the abstraction, this simple code needs to be called from any program:

The complete source code is here.

Coming soon: nano production plants

In Michael Crichton’s _Prey_, Jack Forman faces sentient nano-machines that are built by engineered bacteria. Nano-machines are made by assembling billions of individual atoms, one at a time. Even if techniques improve to achieve speed of one atom per second, a complete assembly will still require more than five years. This build time issue can be solved by exploiting the manufacturing chains of life itself, according to scientists.

This is exactly what researchers at Ludwig Maximilians University in Germany have managed. Using DNA, they have built a simple _hand_ that can grab and release a specific type of protein. Eventually, the same technique will be used to combine atoms in specific arrangements that will yield in nano-machines.

In fact, researchers predict that simple constructions will be possible in two to five years, and advanced ones in five to ten years. The story of _Prey_ is, after all, not so far fetched if we judge by this achievement.

Link: “DNA has nano building in hand”:

Artificial Neural Network in PHP


This is a repost of the article that I wrote for FreeBSD.MU because I think it really belongs here.


Operation Flashpoint, published by CodeMasters, is one of my favourite games. It is very immersive thanks to an assortment of authentic weapons, vehicles and realistic sound effects. But, perhaps, its most important ingredient is the artificial intelligence engine. No wonder, then, that countless hours of gameplay have rekindled my dormant interest in artificial intelligence and artificial neural networks (ANNs).

In this article, I will document how I implemented an ANN using the PHP scripting language. Theories, formulas and respective proofs will not be covered; for details, please visit the links in the next section.

Download the source code here.

Neural networks

What is an artificial neural network?

An artificial neural network is a model of the organic brain. It attempts to reproduce the interactions between the neurons in the brain during the learning and thinking process. It works by applying mathematical formulae obtained from medical studies of how the actual brain works. For a more detailed definition, please see here.

Types of ANNs

There are several different types of ANNs. This implementation will model the feed-forward, multi-layer neural network. See here.


An ANN learns in the same way as the natural brain, that is, by reinforcing the connections between neurons. Several learning (or training) algorithms have been devised. The one my implementation uses is backpropagation, also known as BACKPROP. See here.

Getting more information

The reference is, without doubt, the FAQ.

PHP implementation

Choice of language

Several tutorials for developing ANNs are already available on the Internet. However, most of these cover the usual languages for such a task, that is, C, C++ and Java. Also, a procedural approach is very often adopted instead of an object-oriented one, even for the tutorials using Java.

I chose to develop in PHP to take advantage of its diversity of vector manipulation functions and a shorter coding-debugging lifecycle while in the process of learning the algorithms thanks to the interpreted nature of PHP scripts.

This implementation makes extensive use of OOP techniques. I recommend that the reader familiarises himself or herself with these concepts before proceeding.


If you have skipped the theory explained at the web sites listed above, hopefully, the following will get you up to speed.

A multi-layer ANN consists of at least three layers: one input layer, a hidden layer and one output layer. There can be any number of hidden layers, each with any number of neurons (hidden neurons).

In a feed-forward ANN, each input is fed into each neuron of the first hidden layer whose outputs are fed into the neurons of the next layer, and so on, until the output neurons receive the inputs and produce the final outputs.

Additionally, a bias input may be fed into each layer for better results. Please, see here for an explantion of the importance of the bias input.

Each input has a weight that is initially set to a random value — usually between -1.0 and 1.0; during training, the weights are adjusted using an error-correction algorithm until the ANN gives the desired output. In essence, the final weights make up the “knowledge” acquired by the ANN. It should be noted, however, that each set of weights will only work with an ANN having the same architecture (same number of inputs, layers, hidden neurons and outputs) as the one from which it was obtained.

The simplest definition of the output of an artificial neuron is the result obtained when the sum of its weighted inputs is passed through a stepping function. In our case, the sigmoid function will be used. Its formula is

f(x) = 1 / (1 + exp(-x) )

where exp() is an exponential function.

Therefore, given three inputs x1, x2 and x3, with weights w1, w2, w3, respectively, to a neuron, the output can be obtained as follows:

Step 1 – Calculating the sum of weighted inputs

sum = (x1 * w1) + (x2 * w2) + (x3 * w3)

Step 2 – Calculating the output

output = 1 / (1 + exp(-1 * sum) )

Given an ANN with n output neurons, n outputs are expected. Each output is calculated using the formulae above to give the ANN’s final output — a vector of outputs, that is.

It should be noted that the above sigmoid function only outputs results between 0 and 1. Therefore, some kind scale should be applied to the results to give values outside this range.

An ANN, using the BACKPROP algorithm, is trained by recursively feeding a set of inputs into it and adjusting its weights according to the discrepancy between the actual outputs and desired outputs. The recursion lasts until an acceptable discrepancy is reached.

The adjustment (or weight change) for each input is proportional to its value. So,

weight change = learning rate * input * delta

The learning rate is an arbitrary value that dictates how fast the network should learn. The delta is the rate of change of the discrepancy with respect to the output for the neuron; it is determined by using the delta rule. For a general definition, see:

Calculation of the delta for an output neuron is easily obtained by using the following formula.

delta = actual output * (1 – actual output) * (desired output – actual output)

Calculation of the delta for a hidden neuron is more complex because it depends on the delta values of the neurons of the previous layer as the adjustment proceeds from the output layer to the input layer.

To calculate the delta for a neuron which feeds its output to n neurons in the next layer, the following steps are required.

Step 1 – Calculate the product of the weight [for the output] and the delta of each of the n neurons

sum += weight * deltan

for each delta of the n neurons, where deltan is the delta for the n-th neuron.

Step 2 – Calculate the delta for the hidden neuron

delta = actual output * (1 – actual output) * sum

The concepts and formulae above are sufficient for a successful implementation.


The entire implementation consists of four classes.


This class provides two static methods, random() and sigmoid() respectively.

random() generates random numbers within the limits specified
sigmoid() implements the sigmoid function described earlier


This class abstracts a neuron. It holds an array of inputs and weights; the output; and the calculated delta.

The output is calculated by calling the activate() method and read by calling the getOutput() accessor method. The method setDelta() sets the delta for the neuron, and adjustWeights() adjust the weights according to the delta and the learning rate.


This class abstracts a network layer. It contains a vector of neurons and outputs. It also provides functions to calculate the deltas of each neuron according to the type of the layer. In the case of an output layer, the function calculateOutputDeltas() is used; in the case of a hidden layer, the function calculateHiddenDeltas() is used. These two functions set the delta of each neuron of the layer. The method activate() activates each neuron in turn. The accessor method getOutputs() returns the outputs of all the neurons as a vector; these are then either fed into the next layer’s neurons, or returned as the network output.


This class abstracts the artificial neural network. The constructor takes arguments for the number of hidden layers, the number of neurons per hidden layer and the number of outputs.

The most important methods of this class are setInputs(), train(), activate() and getOutputs(). The network takes a vector of values as input and outputs a vector of values as output.

The methods save() and load() save the network architecture and weights and load a stored network, respectively.

Installation and usage

Contents of archive

The archive contains the following files.

nn.php – the ANN implementation classes
XOR_Training.php – the sample script for training a XOR
XOR_Run.php – the sample script to evaluate XOR operations
xor.dat – the saved XOR network architecture and weights

Creating a neural network

To create an ANN, you will need to include the file nn.php. Using the classes, you can structure your ANN as you wish. Once you have trained your network, you can save it to a file.

To use your network for evaluations, you need to restore the network from the saved file, feed it with inputs and get the output.


Training is achieved by feeding a well prepared set of inputs and desired outputs to the network.

For example,

It is recommended that the network be saved at regular intervals to avoid re-starting the network each time.


To evaluate a set of inputs, the network needs to be loaded from the file and fed with the inputs; and the activate() method called. The output is obtained by calling the getOutputs() method.

For example,


Fixed bug caused by weights not being initialised in the hidden layers.

Technorati Tags: ,

Exception handling or result code

In response to comments on my article entitled “EJB Exception Handling”: about which one of _exception handling_ and _result code_ is better to indicate errors, here are my opinions on the matter.

Put simply, I prefer the solution that is less intrusive and more natural. In this regard, exception handling has the upper hand in my book.

With exception handling, a programmer can concentrate on implementing the business logic without having to worry about checking for errors at every step of a program. This task is left to be handled by the JVM. This works very well whether the programmer has taken care of handling the errors or not. Simply speaking, when an error occurs, an exception is raised and the JVM checks if there is code to handle it. If there is, that code is executed; if not, the JVM handles the exception in a default way. The program does not proceed and, therefore, does not become inconsistent.

On the other hand, with result codes, a programmer is forced to check the outcome of every function calls to make sure that the expected result is obtained and the application can go to the next step. If all the necessary checks are there, everything is fine. However, if some are missing and an unexpected result code is returned, it will not be handled and the next step in the program will be executed regardless, potentially resulting in a disaster.

So, if I am writing a program in a language that provides exception handling, that is what I will use. If the language does not, then using result codes will work just fine, but with a serious impact on development speed as more checks will have to be implemented.

As I wrote in my reply to one comment, both methods are used in a lot of applications and they work well nonetheless. Exception handling seems to be the standard in languages that support it. It all boils down to how maintainable and readable a programmer wants his or her code to be. By adhering to the standard, he/she has a better chance of being understood by others.