一、倒排索引简单介绍

倒排索引（英语：Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。

它是文档检索系统中最经常使用的数据结构。

以英文为例。以下是要被索引的文本：

T0="it is what it is"T1＝"what is it"T2＝"it is a banana"

我们就能得到以下的反向文件索引：

"a":      {2} "banana": {2} "is":     {0, 1, 2} "it":     {0, 1, 2} "what":   {0, 1}

检索的条件”what”, “is” 和 “it” 将相应这个集合：{0, 1}&{0, 1, 2}& {0, 1, 2}={0,1}

对于中文分词，能够使用开源的中文分词工具，这里使用ik-analyzer。

准备几个文本文件，写入内容做測试。

file1.txt内容例如以下:

其实我们发现，互联网裁员潮频现甚至要高于其它行业领域

file2.txt内容例如以下:

面对寒冬，互联网企业不得不调整人员结构，优化雇员的投入产出

file3.txt内容例如以下:

在互联网内部，因为内部竞争机制以及要与竞争对手拼进度

file4.txt内容例如以下:

互联网大公司职员尽管能够从复杂性和专业分工中受益互联网企业不得不调整人员结构

二、加入依赖

出了hadoop主要的jar包意外。加入中文分词的lucene-analyzers-common和ik-analyzers：


          
            
       
        org.apache.lucene
             
       
        lucene-analyzers-common
             
       
        6.0.0
           
       
          
            
       
        cn.bestwu
             
       
        ik-analyzers
             
       
        5.1.0

三、MapReduce程序

关于Lucene 6.0中IK分词的配置參考，MapReduce程序例如以下。

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.FileSplit;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import java.io.IOException;import java.io.StringReader;import java.util.HashMap;import java.util.Map;/** * Created by bee on 4/4/17. */public class InvertIndexIk {    public static class InvertMapper extends Mapper
      
        {        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {            String filename = ((FileSplit) context.getInputSplit()).getPath().getName()                    .toString();            Text fname = new Text(filename);            IKAnalyzer6x analyzer = new IKAnalyzer6x(true);            String line = value.toString();            StringReader reader = new StringReader(line);            TokenStream tokenStream = analyzer.tokenStream(line, reader);            tokenStream.reset();            CharTermAttribute termAttribute = tokenStream.getAttribute                    (CharTermAttribute.class);            while (tokenStream.incrementToken()) {                Text word = new Text(termAttribute.toString());                context.write(word, fname);            }        }    }    public static class InvertReducer extends Reducer
       
         {        public void reduce(Text key, Iterable
        
          values,Reducer
         
          .Context context) throws IOException, InterruptedException {            Map
          
            map = new HashMap
           
            (); for (Text val : values) { if (map.containsKey(val.toString())) { map.put(val.toString(),map.get(val.toString())+1); } else { map.put(val.toString(),1); } } int termFreq=0; for (String mapKey:map.keySet()){ termFreq+=map.get(mapKey); } context.write(key,new Text(map.toString()+" "+termFreq)); } } public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { HadoopUtil.deleteDir("output"); Configuration conf=new Configuration(); String[] otherargs=new String[]{ "input/InvertIndex", "output"}; if (otherargs.length!=2){ System.err.println("Usage: mergesort 
             
             
              "); System.exit(2); } Job job=Job.getInstance(); job.setJarByClass(InvertIndexIk.class); job.setMapperClass(InvertIndexIk.InvertMapper.class); job.setReducerClass(InvertIndexIk.InvertReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job,new Path(otherargs[0])); FileOutputFormat.setOutputPath(job,new Path(otherargs[1])); System.exit(job.waitForCompletion(true) ? 0: 1); }}

四、执行结果

输出例如以下:

专业分工    {file4.txt=1}  1中   {file4.txt=1}  1其实 {file1.txt=1}  1互联网 {file1.txt=1, file3.txt=1, file4.txt=2, file2.txt=1}  5人员  {file4.txt=1, file2.txt=1}  2企业  {file4.txt=1, file2.txt=1}  2优化  {file2.txt=1}  1内部  {file3.txt=2}  2发现  {file1.txt=1}  1受益  {file4.txt=1}  1复杂性 {file4.txt=1}  1大公司 {file4.txt=1}  1寒冬  {file2.txt=1}  1投入产出    {file2.txt=1}  1拼   {file3.txt=1}  1潮   {file1.txt=1}  1现   {file1.txt=1}  1竞争对手    {file3.txt=1}  1竞争机制    {file3.txt=1}  1结构  {file4.txt=1, file2.txt=1}  2职员  {file4.txt=1}  1行业  {file1.txt=1}  1裁员  {file1.txt=1}  1要与  {file3.txt=1}  1调整  {file4.txt=1, file2.txt=1}  2进度  {file3.txt=1}  1雇员  {file2.txt=1}  1面对  {file2.txt=1}  1领域  {file1.txt=1}  1频   {file1.txt=1}  1高于  {file1.txt=1}  1

结果有三列。依次为词项、词项在单个文件里的词频以及总的词频。