正则表达式处理XML,HTML

分手后的思念是犯贱 2021-06-26 16:06 328阅读 0赞
  1. <tr>
  2. <td>5345454354</td><td>2010-3-29 13:48:33</td><td>周杰伦</td>
  3. </tr>
  4. <tr>
  5. <td>6565465466</td><td>2010-3-29 15:34:38</td><td>张学友</td>
  6. </tr>
  7. <tr>
  8. <td>6546546546</td><td>2010-3-30 19:30:50</td><td>刘德华</td>
  9. </tr>
  10. <tr>
  11. <td>9875646545</td><td>2010-3-31 2:20:58</td><td>郭富城</td>
  12. </tr>
  13. <tr>
  14. <td>7868768768</td><td>2010-3-31 8:03:11</td><td>梁朝伟</td>
  15. </tr>

若想取标记<td></td>之间的内容, 可以这样分析

  1. <td>(.*?)</td>
  2. string str = "..........";
  3. string pstr = "<td>(.*?)</td>";
  4. MatchCollection mc = Regex.Matches(str, pstr);
  5. for (int i = 0; i < mc.Count; i++)
  6. {
  7. Response.Write(mc[i].Result("$1"));
  8. }
  9. MatchCollection mc = Regex.Matches(html,@"(?is)(?<=<td>).+?(?=</td>)");
  10. foreach(Match m in mc)
  11. {
  12. //Response.Write(m.Value);//web
  13. MessageBox.Show(m.Value);
  14. }

表达式说明

  • (?<=Expression) 逆序肯定环视,表示所在位置左侧能够匹配Expression
  • (?<!Expression) 逆序否定环视,表示所在位置左侧不能匹配Expression
  • (?=Expression) 顺序肯定环视,表示所在位置右侧能够匹配Expression
  • (?!Expression) 顺序否定环视,表示所在位置右侧不能匹配Expression

    (?is)(?<=).+?(?=)

  • (?is) 模式修饰,i表示忽略大小写,s表示单行模式.能匹配回车换行

  • (?<=<td>) 逆序肯定环视,需要匹配的结果以<td>开头,但是<td>匹配,结果中不包含<td>
  • .+? 任意字符,每次匹配到符合的(任意字符),即尝试匹配后面的表达式,直到后面的表达式失败,回溯上一次匹配结果。
  • (?=</td>) 顺序肯定环视,匹配的结果最后要以</td>结尾,但</td>不匹配,结果中不包含</td>

正则取xml内容比dom4j快50倍?

  1. long t1 = System.nanoTime();
  2. String str = "<xml><ToUserName><![CDATA[gh_520f99dff7cc]]></ToUserName><FromUserName><![CDATA[oBAMOs3aZB0dkbILsBR1wksbmli4]]></FromUserName><CreateTime>1416900555</CreateTime><MsgType><![CDATA[event]]></MsgType><Event><![CDATA[MASSSENDJOBFINISH]]></Event><MsgID>2348714844</MsgID><Status><![CDATA[send success]]></Status><TotalCount>1</TotalCount><FilterCount>1</FilterCount><SentCount>1</SentCount><ErrorCount>0</ErrorCount></xml>";
  3. // Document doc = null;
  4. // try {
  5. // doc = DocumentHelper.parseText(str);
  6. // } catch (DocumentException e) {
  7. // log.error("解析群发xml错误:"+e.getMessage(), e);
  8. // }
  9. //
  10. // Element root = doc.getRootElement();
  11. // String msgid = root.elementTextTrim("MsgID");
  12. // String Status = root.elementTextTrim("Status");
  13. // String TotalCount = root.elementTextTrim("TotalCount");
  14. // String FilterCount = root.elementTextTrim("FilterCount");
  15. // String SentCount = root.elementTextTrim("SentCount");
  16. // String ErrorCount = root.elementTextTrim("ErrorCount");
  17. String msgid = RegExp.getString(str,
  18. "(?<=<MsgID>)[\\s\\S]*?(?=</MsgID>)").trim();
  19. String Status = RegExp.getString(str,
  20. "(?<=<Status><!\\[CDATA\\[)[\\s\\S]*?(?=\\]\\]></Status>)")
  21. .trim();
  22. String TotalCount = RegExp.getString(str,
  23. "(?<=<TotalCount>)[\\s\\S]*?(?=</TotalCount>)")
  24. .trim();
  25. String FilterCount = RegExp.getString(str,
  26. "(?<=<FilterCount>)[\\s\\S]*?(?=</FilterCount>)")
  27. .trim();
  28. String SentCount = RegExp.getString(str,
  29. "(?<=<SentCount>)[\\s\\S]*?(?=</SentCount>)")
  30. .trim();
  31. String ErrorCount = RegExp.getString(str,
  32. "(?<=<ErrorCount>)[\\s\\S]*?(?=</ErrorCount>)")
  33. .trim();
  34. long t2 = System.nanoTime();
  35. log.info(t2-t1);
  36. log.info((t2-t1)*0.000001);
  37. log.info(msgid+", "+Status+", "+TotalCount+", "+FilterCount+", "+SentCount+", "+ErrorCount);

dom4j运行结果:

  1. 2014-11-26 15:25:29,716 INFO [Test] 70 - <220279310>
  2. 2014-11-26 15:25:29,719 INFO [Test] 71 - <220.27930999999998>《==看这里
  3. 2014-11-26 15:25:29,719 INFO [Test] 72 - <2348714844, send success, 1, 1, 1, 0>

正则运行结果:

  1. 2014-11-26 15:28:08,575 INFO [Test] 70 - <4633684>
  2. 2014-11-26 15:28:08,578 INFO [Test] 71 - <4.633684>《==看这里
  3. 2014-11-26 15:28:08,578 INFO [Test] 72 - <2348714844</MsgID>, <![CDATA[send success]]></Status>, 1</TotalCount>, 1</FilterCount>, 1</SentCount>, 0</ErrorCount>>

正则代码:

  1. public class RegExp {
  2. public static ArrayList<String> getStrs(String source, String regex) {
  3. Pattern p = Pattern.compile(regex);
  4. Matcher m = p.matcher(source);
  5. ArrayList<String> list = new ArrayList();
  6. while (m.find()) {
  7. list.add(source.substring(m.start(), m.end()));
  8. }
  9. return list;
  10. }
  11. public static String getString(String source, String regex) {
  12. ArrayList<String> list = getStrs(source, regex);
  13. if (list.size() > 0) {
  14. return (String) list.get(0);
  15. }
  16. return "";
  17. }
  18. public static ArrayList<String> getStrs(String source, String beginStr,
  19. String endStr, boolean isLong) {
  20. if (isLong) {
  21. return getStrs(source,
  22. "(?<=" + replay(beginStr) + ")[\\s\\S]*(?=" + replay(endStr) +
  23. ")");
  24. }
  25. return getStrs(source,
  26. "(?<=" + replay(beginStr) + ")[\\s\\S]*?(?=" + replay(endStr) +
  27. ")");
  28. }
  29. public static String getString(String source, String beginStr,
  30. String endStr, boolean isLong) {
  31. if (isLong) {
  32. return getString(source,
  33. "(?<=" + replay(beginStr) + ")[\\s\\S]*(?=" + replay(endStr) +
  34. ")");
  35. }
  36. return getString(source,
  37. "(?<=" + replay(beginStr) + ")[\\s\\S]*?(?=" + replay(endStr) +
  38. ")");
  39. }
  40. private static String replay(String source) {
  41. String result = "";
  42. result = source.replace("\\", "\\\\");
  43. result = source.replace(".", "\\.");
  44. result = result.replace("(", "\\(");
  45. result = result.replace(")", "\\)");
  46. result = result.replace("[", "\\[");
  47. result = result.replace("]", "\\]");
  48. result = result.replace("{", "\\{");
  49. result = result.replace("}", "\\}");
  50. result = result.replace("$", "\\$");
  51. result = result.replace("?", "\\?");
  52. result = result.replace("&", "\\&");
  53. result = result.replace("*", "\\*");
  54. result = result.replace("!", "\\!");
  55. result = result.replace("^", "\\^");
  56. result = result.replace("+", "\\+");
  57. result = result.replace("#", "\\#");
  58. return result;
  59. }
  60. }

发表评论

表情:
评论列表 (有 0 条评论,328人围观)

还没有评论,来说两句吧...

相关阅读

    相关 表达式

    看一遍就完全搞定的正则表达式教程 正则表达式(regular expression)就是用一个“字符串”来描述一个特征,然后去验证另一个“字符串”是否符合这个特征。比如

    相关 表达式

    正则表达式简介: 正则表达式,又称规则表达式,正则表达式是对字符串(包括普通字符(例如,[a-Z]之间的字母)和特殊字符(称为“元字符”))操作的一 种

    相关 表达式

    正则表达式解析 \ 将下一个字符标记为一个特殊字符、或一个原义字符、或一个向后引用、或一个八进制转义符。 例如,“n”匹配字符“n”。“\n”匹配一个换行符。