比较文本的双字母词与输入文件词

时间:2022-09-13 09:49:06

I want to extract special bigrams (not , word2) in every document and replace these two words with one number (1) if word2 exist in (words.txt) file else it shouldn't be replaced.

我想在每个文档中提取特殊的双字母(不是,word2),如果word2存在于(words.txt)文件中,则用一个数字(1)替换这两个单词,否则不应该替换它们。

here is my data (data.txt):

这是我的数据(data.txt):

fit perfectly clie . purchased not instructions install helpful . improvement battery life not hoped .

product returned not fit nor solve problem ordered . company honest credited account .

cable good not work . cable extremely hot not recognize devices .
...

and (words.txt) file:

和(words.txt)文件:

hoped
instructions
work
fit
...

i've tried :

我试过了 :

   import org.apache.spark.{SparkConf, SparkContext} 

   object test {

   def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("test").setMaster("local")
    val sc = new SparkContext(conf1)
    val searchList = sc.textFile("data/words.txt")
    val searchBigram = searchList.map(word => ("not", word)).collect.toSet
    val sample1 = sc.textFile("data/data.txt")
    val sample2 = sample1.map(s => s.split( """\.""") // split on .
      .map(_.split(" ") // split on space
      .sliding(2) // take continuous pairs
      .map { case Array(a, b) => (a, b)}
      ).map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)
      .map { case (e1, e2) => e1}.mkString(" "))
    sample2.foreach(println)
     }
    }

expected output is :

预期产量是:

fit perfectly clie . purchased 1 install helpful . improvement battery life 1 .

product returned 1 nor solve problem ordered . company honest credited account .

cable good 1 . cable extremely hot 1 devices . 
...

my above code is not complete and it doesn't work, Can anybody help me?

我的上面的代码不完整,它不起作用,任何人都可以帮助我吗?

1 个解决方案

#1


If you want to stick to the bigram approach, I think it would work better if we would make bigrams out of the search items as well.

如果你想坚持使用二元组方法,我认为如果我们也可以从搜索项目中创建双字母组合,那么它会更好。

val searchList  = sc.textFile("input_file")
// let's make this also into bigrams and collect as a set
// making the assumption that this list is relatively small and fit in memory
val searchBigram = searchList.map(word => ("not", word)).collect.toSet

Now, departing from the result of '.sliding(2)', we can transform the arrays to tuples:

现在,从'.sliding(2)'的结果出发,我们可以将数组转换为元组:

val sample = "improvement battery life not hoped".split
// bigrams is an iterator of (improvement,battery), (battery,life), (life,not), (not,hoped)
val bigrams = sample.sliding(2).map{case Array(e1,e2) => (e1,e2)}

//Now we use our bigram search set to find/replace the matching bigrams
// -> (improvement,battery), (battery,life), (life,not), (1,1)
val replaced = bigrams.map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)

// We undo the tuples to obtain the modified string
val  result = replaced.map{case (e1,e2) => e1}.mkString(" ")
// result:String = improvement battery life 1

Integrate that idea in the larger program and that should result in a working process.

将该想法整合到更大的计划中,这应该会产生一个工作流程。

#1


If you want to stick to the bigram approach, I think it would work better if we would make bigrams out of the search items as well.

如果你想坚持使用二元组方法,我认为如果我们也可以从搜索项目中创建双字母组合,那么它会更好。

val searchList  = sc.textFile("input_file")
// let's make this also into bigrams and collect as a set
// making the assumption that this list is relatively small and fit in memory
val searchBigram = searchList.map(word => ("not", word)).collect.toSet

Now, departing from the result of '.sliding(2)', we can transform the arrays to tuples:

现在,从'.sliding(2)'的结果出发,我们可以将数组转换为元组:

val sample = "improvement battery life not hoped".split
// bigrams is an iterator of (improvement,battery), (battery,life), (life,not), (not,hoped)
val bigrams = sample.sliding(2).map{case Array(e1,e2) => (e1,e2)}

//Now we use our bigram search set to find/replace the matching bigrams
// -> (improvement,battery), (battery,life), (life,not), (1,1)
val replaced = bigrams.map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)

// We undo the tuples to obtain the modified string
val  result = replaced.map{case (e1,e2) => e1}.mkString(" ")
// result:String = improvement battery life 1

Integrate that idea in the larger program and that should result in a working process.

将该想法整合到更大的计划中,这应该会产生一个工作流程。