Basically this module is for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
On
HTMLParse docs.
You will see the same example but with no explanation. The example is :
import HTMLParser
from HTMLParser import *
import urllib2
from urllib2 import urlopen
class webancors(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
r = urlopen(url)
self.feed(r.read())
def handle_starttag(self, tag, attrs):
if tag == 'a' and attrs:
print "Link: %s" % attrs[0][1]
I named the python file :
spiderweb.py
I use python to import this file:
>>> import spiderweb
>>> spiderweb.webancors('http://www.yahoo.com')
Link: y-mast-sprite y-mast-txt web
Link: y-mast-link images
Link: y-mast-link video
Link: y-mast-link local
Link: y-mast-link shopping
Link: y-mast-link more
Link: p_13838465-sa-drawer
Link: y-hdr-link
>>>
The method
handle_starttag takes two arguments from HTMLParser.
This arguments, tag and attrs is used to return values.
Note :
The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.
Use "http://" not just "www". If don't use "http://" you see errors.
Seam urllib2 have some troubles with:
File "/usr/lib/python2.5/urllib2.py", line 241, in get_type
raise ValueError, "unknown url type: %s" % self.__original
You can use all functions HTTParser class.