博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
MapReduce编程(二) 文件合并和去重
阅读量:7161 次
发布时间:2019-06-29

本文共 2440 字,大约阅读时间需要 8 分钟。

一、问题描述

对输入的多个文件进行合并,并剔除其中重复的内容,去重后的内容输出到一个文件中。

file1.txt中的内容:

20150101     x20150102     y20150103     x20150104     y

file2.txt中的内容:

20150105     z20150106     x20150101     y20150102     y

file3.txt中的内容:

20150103     x20150104     z20150105     y

二、MapReduce程序

编写MapReduce程序,运行环境参考我的上一篇博客

package com.javacore.hadoop;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import java.io.IOException;/** * Created by bee on 17/3/25. */public class FileMerge {
public static class Map extends Mapper
{
private static Text text = new Text(); public void map(Object key, Text value, Context content) throws IOException, InterruptedException { text = value; content.write(text, new Text("")); } } public static class Reduce extends Reducer
{
public void reduce(Text key, Iterable
values, Context context) throws IOException, InterruptedException { context.write(key, new Text("")); } } public static void main(String[] args) throws Exception { // delete output directory FileUtil.deleteDir("output"); Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://localhost:9000"); String[] otherArgs = new String[]{
"input/filemerge/f*.txt", "output"}; if (otherArgs.length != 2) { System.err.println("Usage:Merge and duplicate removal
"); System.exit(2); } Job job = Job.getInstance(); job.setJarByClass(FileMerge.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

三、输出

20150101     x20150101     y20150102     y20150103     x20150104     y20150104     z20150105     y20150105     z20150106     x

转载地址:http://lrtwm.baihongyu.com/

你可能感兴趣的文章