analitics

Pages

Wednesday, June 2, 2010

The HTMLParser module - just simple example

Basically this module is for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
On HTMLParse docs.
You will see the same example but with no explanation. The example is :

import HTMLParser
from HTMLParser import *
import urllib2 
from urllib2 import urlopen

class webancors(HTMLParser):
 def __init__(self, url):
  HTMLParser.__init__(self)
  r = urlopen(url)
  self.feed(r.read())

 def handle_starttag(self, tag, attrs):
  if tag == 'a' and attrs:
   print "Link: %s" % attrs[0][1]
I named the python file : spiderweb.py
I use python to import this file:

>>> import spiderweb
>>> spiderweb.webancors('http://www.yahoo.com')
Link: y-mast-sprite y-mast-txt web
Link: y-mast-link images
Link: y-mast-link video
Link: y-mast-link local
Link: y-mast-link shopping
Link: y-mast-link more
Link: p_13838465-sa-drawer
Link: y-hdr-link

>>> 
The method handle_starttag takes two arguments from HTMLParser.
This arguments, tag and attrs is used to return values.
Note :
The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.
Use "http://" not just "www". If don't use "http://" you see errors.
Seam urllib2 have some troubles with:

  File "/usr/lib/python2.5/urllib2.py", line 241, in get_type
    raise ValueError, "unknown url type: %s" % self.__original

You can use all functions HTTParser class.