大数据场景下的Java:Hadoop MapReduce实战示例
在大数据场景下,Java语言配合Apache Hadoop的MapReduce框架,可以实现大规模数据的处理和分析。
以下是一个简单的MapReduce实战示例,我们将计算一个文本文件中每个单词出现的次数:
- 创建Mapper类:将文件中的每一行转化为键值对,其中键是单词,值为1。
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(Object key, Text value,
Context context) throws IOException, InterruptedException {
// Split the text into words
String[] words = value.toString().split("\\s+");
// For each word, emit a key-value pair
for (String wordElement : words) {
word.set(wordElement);
context.write(word, one);
}
}
}
- 创建Reducer类:将Map阶段生成的键值对进行聚合,这里的聚合方式是将同一个单词的所有出现次数相加。
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>> {
private IntWritable count = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
// Sum up the occurrences of this word
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
// Emit the word and its count
count.set(sum);
context.write(key, count);
}
}
- 编写Hadoop Job配置文件:定义MapReduce作业的基本信息,包括输入和输出路径。
<?xml version="1.0" encoding="UTF-8"?>
<jobConf>
<property>
<name>mapreduce.input.path</name>
<value>/path/to/your/textfile</value>
</property>
<property>
<name>mapreduce.output.keytab.file</name>
<value>/path/to/your/keytab/file</value>
</property>
<property>
<name>mapreduce.output.textfile</name>
<value>/path/to/output/textfile</value>
</property>
<!-- Use YARN as the resource manager -->
<property>
<name>hadoop.jobtracker.address</name>
<value>yarn.nodemanager.address</value>
</property>
<!-- Specify the number of map tasks and reduce tasks -->
<property>
<name>mapreduce.num.mappers</name>
<value>1000</value>
</property>
<property>
<name>mapreduce.num.reducers</name>
<value>1</value>
</property>
</jobConf>
- 运行MapReduce作业:在Hadoop集群中,使用
hadoop jar
命令来执行你的MapReduce程序。
例如:
hadoop jar hadoop-mapreduce-tools.jar job -config /path/to/your/jobconf.xml /path/to/your/inputfile
这段命令会执行一个名为job
的MapReduce作业,配置文件是jobconf.xml
,输入文件是inputfile
。
还没有评论,来说两句吧...