python lxml_【记录】Python中尝试用lxml去解析html

【背景】Python中，之前一直用BeautifulSoup去解析html的：后来听说BeautifulSoup很慢，而lxml解析html速度很快，所以打算去试试lxml。【折腾过程】1.去lxml主页看了看简介：lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Pyt

weixin_39552179

195人浏览 · 2020-12-03 12:47:44

weixin_39552179 · 2020-12-03 12:47:44 发布

【背景】

Python中，之前一直用BeautifulSoup去解析html的：

后来听说BeautifulSoup很慢，而lxml解析html速度很快，所以打算去试试lxml。

【折腾过程】

1.去lxml主页看了看简介：lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Introduction

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.4 to 3.3. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ.

还支持python 3.3，不错的。

看到好多个版本，找了最新的一个：

此处是win7 x64，所以下载对应的：

然后继续安装lxml：

3.接下来，就是搞懂如何使用lxml去解析html了。

所以，先去找个，正常的html，比如之前的教程：

中所涉及到的。

4.再参考一堆教程：

最终，写成如下，可以正常运行的代码：#!/usr/bin/python

# -*- coding: utf-8 -*-

"""

-------------------------------------------------------------------------------

[Function]

【记录】Python中尝试用lxml去解析html

https://www.crifan.com/python_try_lxml_parse_html

[Date]

2013-05-27

[Author]

Crifan Li

[Contact]

https://www.crifan.com/contact

-------------------------------------------------------------------------------

"""

#---------------------------------import---------------------------------------

import urllib2;

from lxml import etree;

#------------------------------------------------------------------------------

def main():

"""

Demo Python use lxml to extract/parse html

"""

userMainUrl = "http://www.songtaste.com/user/351979/";

req = urllib2.Request(userMainUrl);

resp = urllib2.urlopen(req);

respHtml = resp.read();

#print "respHtml=",respHtml; # you should see the ouput html

#more method please refer:

#【教程】抓取网并网页中所需要的信息之 Python版

#https://www.crifan.com/crawl_website_html_and_extract_info_using_python/

print "Method 3: Use lxml to extract info from html";

crifan

#dom = etree.fromstring(respHtml);

htmlElement = etree.HTML(respHtml);

print "htmlElement=",htmlElement; #

h1userElement = htmlElement.find(".//h1[@class='h1user']");

print "h1userElement=",h1userElement; #

print "type(h1userElement)=",type(h1userElement); #

print "dir(h1userElement)=",dir(h1userElement);

# dir(h1userElement)= ['__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__',

# '__init__', '__iter__', '__len__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__su

# bclasshook__', '_init', 'addnext', 'addprevious', 'append', 'attrib', 'base', 'clear', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'getnext', 'get

# parent', 'getprevious', 'getroottree', 'index', 'insert', 'items', 'iter', 'iterancestors', 'iterchildren', 'iterdescendants', 'iterfind', 'itersiblings', 'itertext', 'keys', 'make

# element', 'nsmap', 'prefix', 'remove', 'replace', 'set', 'sourceline', 'tag', 'tail', 'text', 'values', 'xpath']

print "h1userElement.text=",h1userElement.text; #crifan

attributes = h1userElement.attrib;

print "attributes=",attributes; #{'class': 'h1user'}

print "type(attributes)=",type(attributes); #

classKeyValue = attributes["class"];

print "classKeyValue=",classKeyValue; #h1user

print "type(classKeyValue)=",type(classKeyValue); #

tag = h1userElement.tag;

print "tag=",tag; #h1

innerHtml = etree.tostring(h1userElement);

print "innerHtml=",innerHtml; #innerHtml=