MapReduce - 词频统计

布满荆棘的人生 2024-04-17 23:15 223阅读 0赞

统计一个文本的词频

  1. package Test01;
  2. import org.apache.hadoop.conf.Configuration;
  3. import org.apache.hadoop.fs.Path;
  4. import org.apache.hadoop.io.IntWritable;
  5. import org.apache.hadoop.io.LongWritable;
  6. import org.apache.hadoop.io.Text;
  7. import org.apache.hadoop.mapreduce.Job;
  8. import org.apache.hadoop.mapreduce.Mapper;
  9. import org.apache.hadoop.mapreduce.Reducer;
  10. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  11. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  12. public class WordCount {
  13. public static void main(String[] args){
  14. try{
  15. //获取配置对象
  16. Configuration conf = new Configuration();
  17. //对conf设置
  18. //获取job
  19. Job job = Job.getInstance(conf, "WordCount02");
  20. //对job设置运行主类
  21. job.setJarByClass(WordCount.class);
  22. //对job的map端属性设置
  23. job.setMapperClass(MyMapper.class);
  24. job.setMapOutputKeyClass(Text.class);
  25. job.setMapOutputValueClass(LongWritable.class);
  26. //对job的reduce端属性设置
  27. job.setReducerClass(MyReducer.class);
  28. job.setOutputKeyClass(Text.class);
  29. job.setOutputValueClass(IntWritable.class);
  30. //设置job的输入路径和输出路径
  31. FileInputFormat.addInputPath(job, new Path(args[0]));
  32. FileOutputFormat.setOutputPath(job, new Path(args[1]));
  33. //提交作业
  34. int success = job.waitForCompletion(true) ? 0: 1;
  35. //退出
  36. System.exit(success);
  37. }
  38. catch (Exception e){
  39. e.printStackTrace();
  40. }
  41. }
  42. //自定义Mapper类
  43. public static class MyMapper extends Mapper<Object, Text, Text, LongWritable> {
  44. public Text k = new Text();
  45. public LongWritable v = new LongWritable(1L);
  46. //map函数,map阶段的核心业务逻辑
  47. @Override
  48. protected void map(Object key, Text value, Context context){
  49. try{
  50. //获取行值
  51. String row = value.toString();//map中每次读取一行
  52. //拆分行值
  53. String[] words = row.split(" ");
  54. for(String st: words){
  55. String st2 = "";
  56. char[] ct = st.toCharArray();
  57. if ((ct[st.length()-1] >= 'a' && ct[st.length()-1] <= 'z') || (ct[st.length()-1] >= 'A' && ct[st.length()-1] <= 'Z')){
  58. st2 = String.valueOf(ct);
  59. }
  60. else {
  61. for(int i = 0; i < st.length()-1; i++){
  62. st2 += String.valueOf(ct[i]);
  63. }
  64. }
  65. k.set(st2);
  66. context.write(k, v);
  67. }
  68. }
  69. catch (Exception e){
  70. e.printStackTrace();
  71. }
  72. }
  73. }
  74. //自定义reducer类
  75. public static class MyReducer extends Reducer<Text, LongWritable, Text, IntWritable> {
  76. public IntWritable v = new IntWritable();
  77. //reduce函数,reduce阶段的核心业务逻辑
  78. @Override
  79. protected void reduce(Text key, Iterable<LongWritable> value, Context context){
  80. try{
  81. //定义计数器
  82. int cnt = 0;
  83. for(LongWritable i: value){
  84. cnt += i.get();
  85. }
  86. v.set(cnt);
  87. //reduce的输出
  88. context.write(key, v);
  89. }
  90. catch(Exception e){
  91. e.printStackTrace();
  92. }
  93. }
  94. }
  95. }

(1)点开IDEA右边的maven,依次点击 Lifecycle - > clean(右键)-> Run Maven Build
watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0JsZXNzaW5nWFJZ_size_16_color_FFFFFF_t_70
(2)依次点击 install(右键)-> Run Maven Build
watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0JsZXNzaW5nWFJZ_size_16_color_FFFFFF_t_70 1
(3)将target下的jar包发送到Hadoop服务器
watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0JsZXNzaW5nWFJZ_size_16_color_FFFFFF_t_70 2
(4)输入命令运行

  1. yarn jar MapReducePractice-1.0-SNAPSHOT.jar Test01.WordCount /test/input/t1.txt /test/output/01

(5)查看运行结果

[hadoop@hadoop105 ~]$ hdfs dfs -cat /test/output/01/part-r-00000
Dream 1
Hello 2
I 1
Love 1
World 1
You 2
are 1
hope 1
my 1

发表评论

表情:
评论列表 (有 0 条评论,223人围观)

还没有评论,来说两句吧...

相关阅读