爬虫之pyquery库

刺骨的言语ヽ痛彻心扉 2022-10-03 00:48 284阅读 0赞

官方文档:https://pyquery.readthedocs.io/en/latest/

PyQuery是一个强大又灵活的网页解析库。如果你觉得正则写起来太麻烦、BeautifulSoup语法太难记,而你熟悉jQury的语法,那么PyQuery就是你的绝佳选择。

一、开始

字符串初始化:

  1. from pyquery import PyQuery as pq
  2. d = pq("<html>哈哈哈</html>") # 现在d就相当于jQuery的$
  3. print(d("html"))

URL初始化:

  1. from pyquery import PyQuery as pq
  2. d = pq(url="https://www.baidu.com")
  3. print(d("head"))

文件初始化:

  1. from pyquery import PyQuery as pq
  2. d = pq(filename='demo.html') # filename指定文件路径
  3. print(d("head"))

二、基本CSS选择器

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div id="container">
  3. <ul class="list">
  4. <li class="item-0">first item</li>
  5. <li class="item-1"><a href="link2.html">second item</a></li>
  6. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  7. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  8. <li class="item-0"><a href="link5.html">fifth item</a></li>
  9. </ul>
  10. </div>
  11. """
  12. from pyquery import PyQuery as pq
  13. d = pq(html)
  14. print(d("#container .list li"))

三、查找元素

子元素

  1. d("css选择器").find("li")

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div id="container">
  3. <ul class="list">
  4. <li class="item-0">first item</li>
  5. <li class="item-1"><a href="link2.html">second item</a></li>
  6. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  7. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  8. <li class="item-0"><a href="link5.html">fifth item</a></li>
  9. </ul>
  10. </div>
  11. """
  12. from pyquery import PyQuery as pq
  13. d = pq(html)
  14. items = d(".list")
  15. print(type(items)) # <class 'pyquery.pyquery.PyQuery'>
  16. li = items.find("li")
  17. print(type(li)) # <class 'pyquery.pyquery.PyQuery'>
  18. print(li)
  19. """
  20. <li class="item-0">first item</li>
  21. <li class="item-1"><a href="link2.html">second item</a></li>
  22. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  23. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  24. <li class="item-0"><a href="link5.html">fifth item</a></li>
  25. """

父元素

  1. d("css选择器").parent(<css选择器(可无)>)

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. items = d(".list")
  17. parents = items.parents()
  18. print(parents)
  19. """
  20. <div class="wrap">
  21. <div id="container">
  22. <ul class="list">
  23. <li class="item-0">first item</li>
  24. <li class="item-1"><a href="link2.html">second item</a></li>
  25. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  26. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  27. <li class="item-0"><a href="link5.html">fifth item</a></li>
  28. </ul>
  29. </div>
  30. </div>
  31. <div id="container">
  32. <ul class="list">
  33. <li class="item-0">first item</li>
  34. <li class="item-1"><a href="link2.html">second item</a></li>
  35. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  36. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  37. <li class="item-0"><a href="link5.html">fifth item</a></li>
  38. </ul>
  39. </div>
  40. """

d(“.list”).parents()

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. items = d(".list")
  17. parents = items.parents(".wrap")
  18. print(parents)
  19. """
  20. <div class="wrap">
  21. <div id="container">
  22. <ul class="list">
  23. <li class="item-0">first item</li>
  24. <li class="item-1"><a href="link2.html">second item</a></li>
  25. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  26. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  27. <li class="item-0"><a href="link5.html">fifth item</a></li>
  28. </ul>
  29. </div>
  30. </div>
  31. """

d(“.list”).parents(“.wrap”)

兄弟元素

  1. d("css选择器").siblings(<css选择器(可无)>)

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. li = d(".list .item-0.active")
  17. print(li.siblings())
  18. """
  19. <li class="item-1"><a href="link2.html">second item</a></li>
  20. <li class="item-0">first item</li>
  21. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  22. <li class="item-0"><a href="link5.html">fifth item</a></li>
  23. """

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. li = d(".list .item-0.active")
  17. print(li.siblings(".active"))
  18. """
  19. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  20. """

四、遍历

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. li = d("li").items()
  17. print(type(li)) # <class 'generator'>
  18. for i in li:
  19. print(i)
  20. """
  21. <li class="item-0">first item</li>
  22. <li class="item-1"><a href="link2.html">second item</a></li>
  23. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  24. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  25. <li class="item-0"><a href="link5.html">fifth item</a></li>
  26. """

五、获取信息

获取属性

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. a = d(".item-0.active a")
  17. print(a.attr("href"))
  18. print(a.attr.href)

获取文本

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. a = d(".item-0.active a")
  17. print(a.text())
  18. """
  19. third item
  20. """

获取html

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. li = d(".item-0.active")
  17. print(li)
  18. print(li.html())
  19. """
  20. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  21. <a href="link3.html"><span class="bold">third item</span></a>
  22. """

六、DOM操作

addClass()、removeClass()

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. li = d(".item-0.active")
  17. print(li)
  18. li.removeClass("active")
  19. print(li)
  20. li.addClass("active")
  21. print(li)
  22. """
  23. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  24. <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
  25. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  26. """

attr()、css()

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. li = d(".item-0.active")
  17. print(li)
  18. li.attr("name", "link")
  19. print(li)
  20. li.css("font-size", "14px")
  21. print(li)
  22. """
  23. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  24. <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
  25. <li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
  26. """

remove()

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. Hello, World.
  4. <p>This is a paragraph.</p>
  5. </div>
  6. """
  7. from pyquery import PyQuery as pq
  8. d = pq(html)
  9. wrap = d(".wrap")
  10. print(wrap.text())
  11. """
  12. Hello, World.
  13. This is a paragraph.
  14. """
  15. wrap.find("p").remove()
  16. print(wrap.text()) # Hello, World.

其他DOM方法

https://pyquery.readthedocs.io/en/latest/api.html

七、伪类选择器

ContractedBlock.gif ExpandedBlockStart.gif

  1. html = """
  2. <div class="wrap">
  3. <div id="container">
  4. <ul class="list">
  5. <li class="item-0">first item</li>
  6. <li class="item-1"><a href="link2.html">second item</a></li>
  7. <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
  8. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  9. <li class="item-0"><a href="link5.html">fifth item</a></li>
  10. </ul>
  11. </div>
  12. </div>
  13. """
  14. from pyquery import PyQuery as pq
  15. d = pq(html)
  16. li = d("li:first-child")
  17. print(li) # <li class="item-0">first item</li>
  18. li = d("li:last-child")
  19. print(li) # <li class="item-0"><a href="link5.html">fifth item</a></li>
  20. li = d("li:nth-child(2)")
  21. print(li) # <li class="item-1"><a href="link2.html">second item</a></li>
  22. li = d("li:gt(2)") # 从0开始计数,索引大于2
  23. print(li)
  24. """
  25. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  26. <li class="item-0"><a href="link5.html">fifth item</a></li>
  27. """
  28. li = d("li:nth-child(2n)") # 获取偶数顺序的元素(从1开始)
  29. print(li)
  30. """
  31. <li class="item-1"><a href="link2.html">second item</a></li>
  32. <li class="item-1 active"><a href="link4.html">fourth item</a></li>
  33. """
  34. li = d("li:contains(second)") # 根据文本匹配,匹配文本包含second的标签
  35. print(li) # <li class="item-1"><a href="link2.html">second item</a></li>

更多选择器:http://www.w3school.com.cn/cssref/css_selectors.asp

转载于:https://www.cnblogs.com/believepd/p/10657877.html

发表评论

表情:
评论列表 (有 0 条评论,284人围观)

还没有评论,来说两句吧...

相关阅读

    相关 PyQuery

    强大又灵活的网页解析库。如果你觉得正则写起来太麻烦,BearutifulSoup 语法太难记,而又熟悉 jQuery 的语法,那么 PyQuery 就是你的绝佳选择 1、初

    相关 如何使用PyQuery

    PyQuery是一个类似于jQuery的Python库,它提供了一种可用于解析和操作HTML文档的强大工具。如何使用PyQuery库呢,下面是使用PyQuery库的详细说明: