How can I read RSS?
Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of Python does not come with an XML parser; you will need to install
PyXML first.)
from xml.dom import minidom
import urllib
def load(rssURL):
return minidom.parse(urllib.urlopen(rssURL))
This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.
The next bit is the tricky part. To compensate for the differences in RSS formats, we'll need a function that searches for specific elements in any number of namespaces. Python's XML library includes a
getElementsByTagNameNS which takes a namespace and a tag name, so we'll use that to make our code general enough to handle RSS 0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90. This function will find
all elements with a given name, anywhere within a node. That's a good thing; it means that we can search for
item elements within the root node and always find them, whether they are inside or outside the
channel element.
DEFAULT_NAMESPACES = \
(None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0
'http://purl.org/rss/1.0/', # RSS 1.0
'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90
)
def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
for namespace in possibleNamespaces:
children = node.getElementsByTagNameNS(namespace, tagName)
if len(children): return children
return []
Finally, we need two utility functions to make our lives easier. First, our
getElementsByTagName function will return a list of elements, but most of the time we know there's only going to be one. An
item only has one
title, one
link, one
description, and so on. We'll define a
first function that returns the first element of a given name (again, searching across several different namespaces). Second, Python's XML libraries are great at parsing an XML document into nodes, but not that helpful at putting the data back together again. We'll define a
textOf function that returns the entire text of a particular XML element.
def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
children = getElementsByTagName(node, tagName, possibleNamespaces)
return len(children) and children[0] or None
def textOf(node):
return node and "".join([child.data for child in node.childNodes]) or ""
That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful information from each item:
DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)
if __name__ == '__main__':
import sys
rssDocument = load(sys.argv[1])
for item in getElementsByTagName(rssDocument, 'item'):
print 'title:', textOf(first(item, 'title'))
print 'link:', textOf(first(item, 'link'))
print 'description:', textOf(first(item, 'description'))
print 'date:', textOf(first(item, 'date', DUBLIN_CORE))
print 'author:', textOf(first(item, 'creator', DUBLIN_CORE))
print
Running it with
our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or authors):
$ python rss1.py
http://www.xml.com/2002/12/18/examples/rss091.xml.txt title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date:
author:
title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date:
author:
title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date:
author:
For both the
sample RSS 1.0 feed and
sample RSS 2.0 feed, we also get dates and authors for each
item. We reuse our custom
getElementsByTagName function, but pass in the Dublin Core namespace and appropriate tag name. We could reuse this same function to extract information from any of the basic RSS modules. (There are a few advanced modules specific to RSS 1.0 that would require a full RDF parser, but they are not widely deployed in public RSS feeds.)
Here's the output against
our sample RSS 1.0 feed:
$ python rss1.py
http://www.xml.com/2002/12/18/examples/rss10.xml.txt title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date: 2002-12-04
author: Will Provost
title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date: 2002-12-04
author: Priya Lakshminarayanan
title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date: 2002-12-04
author: Antoine Quint
Running against
our sample RSS 2.0 feed produces the same results.
This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.