【零基础学爬虫】BeautifulSoup库详解

小灰灰 2022-04-24 07:20 490阅读 0赞

# 回顾 #

上一次介绍正则表达式的时候，分享了一个爬虫实战，即爬取豆瓣首页所有的：书籍、链接、作者、出版日期等。在上个实战中我们是通过正则表达式来解析源码爬取数据，整体来说上次实战中的正则表达式是比较复杂的，所以引入了今天的主角BeautifulSoup：它是灵活方便的网页解析库，处理高效，而且支持多种解析器。使用Beautifulsoup，不用编写正则表达式就可以方便的实现网页信息的提取。

# 一、 BeautifulSoup的安装 #

**pip install beautifulsoup4**

# 二、用法讲解 #

### 1. 解析库 ###

<table> 
 <thead> 
  <tr> 
   <th>解析器</th> 
   <th>使用方法</th> 
   <th>优势</th> 
   <th>劣势</th> 
  </tr> 
 </thead> 
 <tbody> 
  <tr> 
   <td>Python标准库</td> 
   <td>BeautifulSoup(markup, “html.parser”)</td> 
   <td>Python的内置标准库、执行速度适中 、文档容错能力强</td> 
   <td>Python 2.7.3 or 3.2.2)前的版本中文容错能力差</td> 
  </tr> 
  <tr> 
   <td>lxml HTML 解析器</td> 
   <td>BeautifulSoup(markup, “lxml”)</td> 
   <td>速度快、文档容错能力强，常用</td> 
   <td>需要安装C语言库 lxml</td> 
  </tr> 
  <tr> 
   <td>lxml XML 解析器</td> 
   <td>BeautifulSoup(markup, “xml”)</td> 
   <td>速度快、唯一支持XML的解析器</td> 
   <td>需要安装C语言库</td> 
  </tr> 
  <tr> 
   <td>html5lib</td> 
   <td>BeautifulSoup(markup, “html5lib”)</td> 
   <td>最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档</td> 
   <td>速度慢、不依赖外部扩展</td> 
  </tr> 
 </tbody> 
</table>

### 2.基本使用 ###

下面是一个不完整的html：body标签、html标签都没有闭合

html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """

下面使用lxml解析库解析上面的html

from bs4 import BeautifulSoup#引包
    soup = BeautifulSoup(html, 'lxml')#声明bs对象和解析器
    print(soup.prettify())#格式化代码，自动补全代码，进行容错的处理
    print(soup.title.string)#打印出title标签中的内容

下面是容错处理时标签补全后的结果和获取的title内容，可以看到html和body标签都被补全了：

<html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title" name="dromouse">
       <b>
        The Dormouse's story
       </b>
      </p >
      <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href=" " id="link1">
        
       </ a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </ a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </ a>
       ;
    and they lived at the bottom of a well.
      </p >
      <p class="story">
       ...
      </p >
     </body>
    </html>
    The Dormouse's story

## 3.标签选择器 ##

\#\#\#\#（1）选择元素  
依旧使用上面的html

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title)
    print(type(soup.title))
    print(soup.head)
    print(soup.p)

结果是：

<title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    <head><title>The Dormouse's story</title></head>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p >

从结果发现只输出了一个p标签，但是HTML中有3个p标签  
**标签选择器的特性：当有多个标签的时候，它只返回第一个标签的内容**

#### （2）获取属性 ####

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.attrs['name'])
    print(soup.p['name'])

输出结果：

> dromouse  
> dromouse

#### (3) 获取内容 ####

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.string)

输出结果：

> The Dormouse’s story

#### (4) 嵌套获取属性 ####

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.head.title.string)

输出：

> The Dormouse’s story

#### (5)获取子节点和子孙节点 ####

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.contents)

输出的是一个列表

['\n            Once upon a time there were three little sisters; and their names were\n            ', 
    <a class="sister" href=" " id="link1">
    <span>Elsie</span>
    </ a>,
     '\n'
    , <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
    , ' \n            and\n            '
    , <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
    , '\n            and they lived at the bottom of a well.\n        ']

另外一种获取方式

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.p.children)
    for i, child in enumerate(soup.p.children):
        print(i, child)

输出：

<list_iterator object at 0x1064f7dd8>
    0 
                Once upon a time there were three little sisters; and their names were
         　       
    1 <a class="sister" href=" " id="link1">
    <span>Elsie</span>
    </ a>
    2 
    　
    3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
    4  
        and　　　
    5 　　　
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>　　　　　　　　　　　
    6 
        and they lived at the bottom of a well.

#### （6）获取父节点 ####

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.a.parent)

程序打印出的是p标签，即a标签的父节点：

<p class="story">
                Once upon a time there were three little sisters; and their names were
                <a class="sister" href=" " id="link1">
    <span>Elsie</span>
    </ a>
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a> 
                and
                <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
                and they lived at the bottom of a well.
            </p >

于此类似的还有：

*  parents属性：输出当前标签的所有祖先节点
 *  next\_sibings 属性：输出当前标签之后的兄弟标签
 *  previous\_sibling属性输出当前标签之前的兄弟标签

**上面是标签选择器**：处理速度很快，但是这种方式不能满足我们解析HTML的需求。因此beautifulsoup还提供了一些其他的方法

#### 3.标准选择器 ####

\*\*find\_all( name , attrs , recursive , text , **kwargs )**  
可根据标签名、属性、内容查找文档  
下面使用的测试HTML都是下面这个

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''

**(1) 根据标签名，即name查找**

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all('ul'))
    print(type(soup.find_all('ul')[0]))

输出了所有的ul标签：

[<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    <class 'bs4.element.Tag'>

上述可以继续进行嵌套：

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.find_all('ul'):
        print(ul.find_all('li'))
       #可以更进一步，获取li中的属性值：ul.find_all('li')[0]['class']

**（2）根据属性名进行查找**

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(id='list-1'))
    print(soup.find_all(name='element'))

输出：

[<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<ul class="list" id="list-1" name="elements">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

**(3)根据文本的内容，即text进行选择**

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(text='Foo'))

输出：

> \['Foo;‘Foo’\]

返回的不是标签，在查找的时候用途不大，更多是做内容匹配

**find( name , attrs , recursive , text , kwargs )**  
和findall类似，只不过find方法只是返回单个元素

**find\_parents() find\_parent()**  
find\_parents()返回所有祖先节点，find\_parent()返回直接父节点。

**find\_next\_siblings() find\_next\_sibling()**  
find\_next\_siblings()返回后面所有兄弟节点，find\_next\_sibling()返回后面第一个兄弟节点。

**find\_previous\_siblings() find\_previous\_sibling()**  
find\_previous\_siblings()返回前面所有兄弟节点，find\_previous\_sibling()返回前面第一个兄弟节点。

**find\_all\_next() find\_next()**  
find\_all\_next()返回节点后所有符合条件的节点, find\_next()返回第一个符合条件的节点

**find\_all\_previous() 和 find\_previous()**  
find\_all\_previous()返回节点后所有符合条件的节点, find\_previous()返回第一个符合条件的节点

### **CSS选择器** ###

通过select()直接传入CSS选择器即可完成选择

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    #选择class为panel中的class为panel-heading的HTML，选择class时要在前面加‘.’
    print(soup.select('.panel .panel-heading'))
    print(soup.select('ul li'))#标签选择，选择ul标签中的li标签
    print(soup.select('#list-2 .element'))#‘#’表示id选择：选择id为list-2中class为element中的元素
    print(type(soup.select('ul')[0]))

输出：

[<div class="panel-heading">
    <h4>Hello</h4>
    </div>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    <class 'bs4.element.Tag'>

也可以进行嵌套，不过没必要，上面通过标签之间使用空格就实现了嵌套：

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
        print(ul.select('li'))

输出：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    [<li class="element">Foo</li>, <li class="element">Bar</li>]

#### 获取到html后如何获取属性和内容： ####

from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for ul in soup.select('ul'):
        print(ul['id'])#或者 print(ul.attrs['id'])
    获取内容
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    for li in soup.select('li'):
        print(li.get_text())

### 总结 ###

*  推荐使用lxml解析库，必要时使用html.parser
 *  标签选择筛选功能弱但是速度快
 *  建议使用find()、find\_all() 查询匹配单个结果或者多个结果
 *  如果对CSS选择器熟悉建议使用select()，**方便**
 *  记住常用的获取属性和文本值的方法

更多关于Beautifulsoup的使用可以查看对应的文档说明

--------------------

扫描下方二维码，**及时**获取更多**互联网求职面经**、**java**、**python**、**爬虫**、**大数据**等技术，和**海量资料分享**：公众号后台回复“**csdn**”即可免费领取【csdn】和【百度文库】下载服务；公众号后台回复“**资料**”:即可领取**5T精品学习资料**、**java面试考点**和**java面经总结**，以及**几十个java、大数据项目**，**资料很全，你想找的几乎都有**  
![扫码关注，及时获取更多精彩内容。（博主今日头条大数据工程师）][1240]

[1240]: /images/20220218/7e11380086774b3398c79236274455ea.png