SourceForge.net Logo

pullparser

A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Many simple HTML parsing tasks are simpler this way than with the HTMLParser module. pullparser.PullParser is a subclass of HTMLParser.HTMLParser.

Examples:

This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <a>...</a> tags:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
    if token.type == "endtag": continue
    url = dict(token.attrs).get("href", "-")
    text = p.get_compressed_text(endat=("endtag", "a"))
    print "%s\t%s" % (url, text)

This program extracts the <title> from the document:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
    title = p.get_compressed_text()
    print "Title: %s" % title

Thanks to Gisle Aas, who wrote HTML::TokeParser.

Download

All documentation (including this web page) is included in the distribution.

Stable release.

The only reason this appears to be a beta release is that I forgot to remove the 'b' from the version string on the last release...

For installation instructions, see the INSTALL file included in the distribution.

See also

Beautiful Soup is widely recommended. More robust than this module.

FAQs

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, July 2005.