基于crawler4j、jsoup、javacsv的爬虫实践

向右看齐 2022-07-12 22:44 285阅读 0赞
  1. crawler4j基础
    crawler4j是一个基于Java的爬虫开源项目,其官方地址如下:
    http://code.google.com/p/crawler4j/
    crawler4j的使用主要分为两个步骤:
    实现一个继承自WebCrawler的爬虫类;
    通过CrawlController调用实现的爬虫类。
    WebCrawler是一个抽象类,继承它必须实现两个方法:shouldVisit和visit。其中:
    shouldVisit是判断当前的URL是否已经应该被爬取(访问);
    visit则是爬取该URL所指向的页面的数据,其传入的参数即是对该web页面全部数据的封装对象Page。
    另外,WebCrawler还有其它一些方法可供覆盖,其方法的命名规则类似于Android的命名规则。如getMyLocalData方法可以返回WebCrawler中的数据;onBeforeExit方法会在该WebCrawler运行结束前被调用,可以执行一些资源释放之类的工作。
    相对而言,CrawlController的调用就比较格式化了。一般地,它的调用代码如下:

    String crawlStorageFolder = “data/crawl/root”;
    int numberOfCrawlers = 7;

    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);

    /*

    • Instantiate the controller for this crawl.
      */
      PageFetcher pageFetcher = new PageFetcher(config);
      RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
      CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

    /*

    • For each crawl, you need to add some seed urls. These are the first
    • URLs that are fetched and then the crawler starts following links
    • which are found in these pages
      */
      controller.addSeed(“http://www.ics.uci.edu/~welling/“);
      controller.addSeed(“http://www.ics.uci.edu/~lopes/“);
      controller.addSeed(“http://www.ics.uci.edu/“);

    /*

    • Start the crawl. This is a blocking operation, meaning that your code
    • will reach the line after this only when crawling is finished.
      */
      controller.start(MyCrawler.class, numberOfCrawlers);

CrawlController自带多线程功能,start方法中的第二个参数numberOfCrawlers即是同时开启的线程数。
另外,由于CrawlController对WebCrawler特殊的调用方式——反射(上述代码最后一行),因此WebCrawler的实现类必须拥有无参的构造方法,且有参的构造方法不会生效。对WebCrawler的实现类的私有成员的赋值需要通过静态方法来实现,示例参见crawler4j提供的例子:Image Crawler
更多信息请参见crawler4j的代码和示例。

  1. jsoup基础
    jsoup是一个基于Java的开源HTML解析器,其官网地址如下:
    http://jsoup.org/
    jsoup最大的特点,或者说,它比使用DOM4J进行HTML解析更好的原因,是它可以采用jQuery选择器的语法。
    例如:

    Document doc = Jsoup.connect(“http://en.wikipedia.org/").get();
    Elements newsHeadlines = doc.select(“#mp-itn b a”);

上述代码就是获取了http://en.wikipedia.org/页面中id为mp-itn的元素下的标签中的标签,与jQuery选择器的结果一致。
更多jsoup的使用方法请参见jsoup的示例(在主页的右侧Cookbook Content中)。
需要特别说明的是,jsoup中主要有三种操作对象:Document、Elements及Element。其中:
Document继承自Element类,它包含页面中的全部数据,通过Jsoup类的静态方法获得;
Elements是Element的集合类;
Element是页面元素的实体类,包含了诸多操作页面元素的方法,其中除了类似于jQuery选择器的select方法外,还有大量类似于JS和jQuery的对DOM元素进行操作的方法,如getElementById,text,addClass,等等。

  1. javacsv基础
    javacsv是一个基于Java的开源CSV文件读写工具,其官方地址如下:
    http://www.csvreader.com/java_csv.php
    CSV文件的读写其实很简单,可以自己实现,网上也有诸多示例。使用javacsv的原因在于其代码简洁易用。
    javacsv的使用示例参见其官方示例:
    http://www.csvreader.com/java_csv_samples.php
    需要说明的是,读写CSV文件时,若存在中文,请尽量使用FileReader(读)及FileWriter(写),而非FileInputStream和FileOutputStream,以免出现乱码。
  2. 爬虫实践
    下面的实践的目标是爬取搜车网的全部二手车信息,并作为CSV文件输出。代码如下:
    Maven pom.xml


    edu.uci.ics
    crawler4j
    3.5
    jar
    compile

    org.jsoup
    jsoup
    1.7.3

    net.sourceforge.javacsv
    javacsv
    2.0

MyCrawler.java

  1. import java.io.File;
  2. import java.io.FileWriter;
  3. import java.io.IOException;
  4. import java.util.regex.Pattern;
  5. import org.jsoup.Jsoup;
  6. import org.jsoup.nodes.Document;
  7. import org.jsoup.nodes.Element;
  8. import org.jsoup.select.Elements;
  9. import com.csvreader.CsvWriter;
  10. import edu.uci.ics.crawler4j.crawler.Page;
  11. import edu.uci.ics.crawler4j.crawler.WebCrawler;
  12. import edu.uci.ics.crawler4j.parser.HtmlParseData;
  13. import edu.uci.ics.crawler4j.url.WebURL;
  14. public class MyCrawler extends WebCrawler {
  15. private final static Pattern FILTERS = Pattern
  16. .compile(".*(\\.(css|js|bmp|gif|jpe?g|ico"
  17. + "|png|tiff?|mid|mp2|mp3|mp4"
  18. + "|wav|avi|mov|mpeg|ram|m4v|pdf"
  19. + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
  20. private final static String URL_PREFIX = "http://www.souche.com/pages/onsale/sale_car_list.html?";
  21. private final static Pattern URL_PARAMS_PATTERN = Pattern
  22. .compile("carbrand=brand-\\d+(&index=\\d+)?");
  23. private final static String CSV_PATH = "data/crawl/data.csv";
  24. private CsvWriter cw;
  25. private File csv;
  26. public MyCrawler() throws IOException {
  27. csv = new File(CSV_PATH);
  28. if (csv.isFile()) {
  29. csv.delete();
  30. }
  31. cw = new CsvWriter(new FileWriter(csv, true), ',');
  32. cw.write("title");
  33. cw.write("brand");
  34. cw.write("newPrice");
  35. cw.write("oldPrice");
  36. cw.write("mileage");
  37. cw.write("age");
  38. cw.write("stage");
  39. cw.endRecord();
  40. cw.close();
  41. }
  42. /** * You should implement this function to specify whether the given url * should be crawled or not (based on your crawling logic). */
  43. @Override
  44. public boolean shouldVisit(WebURL url) {
  45. String href = url.getURL().toLowerCase();
  46. if (FILTERS.matcher(href).matches() || !href.startsWith(URL_PREFIX)) {
  47. return false;
  48. }
  49. String[] strs = href.split("\\?");
  50. if (strs.length < 2) {
  51. return false;
  52. }
  53. if (!URL_PARAMS_PATTERN.matcher(strs[1]).matches()) {
  54. return false;
  55. }
  56. return true;
  57. }
  58. /** * This function is called when a page is fetched and ready to be processed * by your program. */
  59. @Override
  60. public void visit(Page page) {
  61. String url = page.getWebURL().getURL();
  62. if (page.getParseData() instanceof HtmlParseData) {
  63. HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
  64. String html = htmlParseData.getHtml();
  65. Document doc = Jsoup.parse(html);
  66. String brand = doc.select("div.choose_item").first().text();
  67. Elements contents = doc.select("div.list_content");
  68. if (contents.size() == 20 && !url.contains("index=")) {
  69. return;
  70. } else {
  71. System.out.println("URL: " + url);
  72. }
  73. for (Element c : contents) {
  74. Element info = c.select(".list_content_carInfo").first();
  75. String title = info.select("h1").first().text();
  76. Elements prices = info.select(".list_content_price div");
  77. String newPrice = prices.get(0).text();
  78. String oldPrice = prices.get(1).text();
  79. Elements others = info.select(".list_content_other div");
  80. String mileage = others.get(0).select("ins").first().text();
  81. String age = others.get(1).select("ins").first().text();
  82. String stage = "unknown";
  83. if (c.select("i.car_tag_zhijian").size() != 0) {
  84. stage = c.select("i.car_tag_zhijian").text();
  85. } else if (c.select("i.car_tag_yushou").size() != 0) {
  86. stage = "presell";
  87. }
  88. try {
  89. cw = new CsvWriter(new FileWriter(csv, true), ',');
  90. cw.write(title);
  91. cw.write(brand);
  92. cw.write(newPrice.replaceAll("[¥万]", ""));
  93. cw.write(oldPrice.replaceAll("[¥万]", ""));
  94. cw.write(mileage);
  95. cw.write(age);
  96. cw.write(stage);
  97. cw.endRecord();
  98. cw.close();
  99. } catch (IOException e) {
  100. e.printStackTrace();
  101. }
  102. }
  103. }
  104. }
  105. }

Controller.java

  1. import edu.uci.ics.crawler4j.crawler.CrawlConfig;
  2. import edu.uci.ics.crawler4j.crawler.CrawlController;
  3. import edu.uci.ics.crawler4j.fetcher.PageFetcher;
  4. import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
  5. import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
  6. public class Controller {
  7. public static void main(String[] args) throws Exception {
  8. String crawlStorageFolder = "data/crawl/root";
  9. int numberOfCrawlers = 7;
  10. CrawlConfig config = new CrawlConfig();
  11. config.setCrawlStorageFolder(crawlStorageFolder);
  12. /* * Instantiate the controller for this crawl. */
  13. PageFetcher pageFetcher = new PageFetcher(config);
  14. RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
  15. RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
  16. CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
  17. /* * For each crawl, you need to add some seed urls. These are the first * URLs that are fetched and then the crawler starts following links * which are found in these pages */
  18. controller.addSeed("http://www.ics.uci.edu/~welling/");
  19. controller.addSeed("http://www.ics.uci.edu/~lopes/");
  20. controller.addSeed("http://www.ics.uci.edu/");
  21. /* * Start the crawl. This is a blocking operation, meaning that your code * will reach the line after this only when crawling is finished. */
  22. controller.start(MyCrawler.class, numberOfCrawlers);
  23. }
  24. }

文章来源: http://blog.csdn.net/sadfishsc/article/details/20614105

发表评论

表情:
评论列表 (有 0 条评论,285人围观)

还没有评论,来说两句吧...

相关阅读