如何从目录或文件夹中读取重复的单词计数

时间:2023-02-06 10:33:01

I got this below program from an coding site.

我从编码站点得到了以下程序。

The following code read text file and find duplicate words.

以下代码读取文本文件并查找重复的单词。

To read from each text files and display it's duplicate words count line by line. And how to call that files if it is not stored as String, I used buffered reader but I am not getting my output.

从每个文本文件中读取并显示它的重复单词逐行计数。如果它没有存储为String,如何调用该文件,我使用缓冲读取器,但我没有得到我的输出。

My questions:

我的问题:

  1. How can I make the program read multiple files from given folder?

    如何让程序从给定文件夹中读取多个文件?

  2. How to save the results in Excel file format?

    如何以Excel文件格式保存结果?

Any suggestions Welcomed.

任何建议欢迎。

import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.Map.Entry;


public class MaxDuplicateWordCount {

    public Map<String, Integer> getWordCount(String fileName){

        FileInputStream fis = null;
        DataInputStream dis = null;
        BufferedReader br = null;
        Map<String, Integer> wordMap = new HashMap<String, Integer>();

        try {
            fis = new FileInputStream(fileName);
            dis = new DataInputStream(fis);
            br = new BufferedReader(new InputStreamReader(dis));
            String line = null; 
            while((line = br.readLine()) != null){
                StringTokenizer st = new StringTokenizer(line, " ");
                while(st.hasMoreTokens()){
                    String tmp = st.nextToken().toLowerCase();
                    if(wordMap.containsKey(tmp)){
                        wordMap.put(tmp, wordMap.get(tmp)+1);
                    } else {
                        wordMap.put(tmp, 1);
                    }
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally{
            try{if(br != null) br.close();}catch(Exception ex){}
        }
        return wordMap;
    }

    public List<Entry<String, Integer>> sortByValue(Map<String, Integer> wordMap){

        Set<Entry<String, Integer>> set = wordMap.entrySet();
        List<Entry<String, Integer>> list = new ArrayList<Entry<String, Integer>>(set);
        Collections.sort( list, new Comparator<Map.Entry<String, Integer>>()
        {
            public int compare( Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2 )
            {
                return (o2.getValue()).compareTo( o1.getValue() );
            }
        } );
        return list;
    }

    public static void main(String a[]){



        MaxDuplicateWordCount mdc = new MaxDuplicateWordCount();
        Map<String, Integer> wordMap = mdc.getWordCount("E:\\Blog 39.txt");

        List<Entry<String, Integer>> list = mdc.sortByValue(wordMap);
        for(Map.Entry<String, Integer> entry:list){
            System.out.println(entry.getKey()+" ="+entry.getValue());
        }
    }
}

2 个解决方案

#1


2  

Intro

After chatting with OP, here is briefly what OP requires:

在与OP聊天之后,简要介绍OP需要什么:

1- Read file/s from specific folder, files are typically Unicode as text files.
2- The files will be process in OP Algorithm in the Question, and the results of the Algorithm should be saved on Unicode file again (Later OP asked to be saved as Excel file (.XLS) because of Unicode compatibility with Excel)

1-从特定文件夹中读取文件,文件通常是Unicode作为文本文件。 2-文件将在问题中的OP算法中处理,并且算法的结果应该再次保存在Unicode文件中(由于Unicode与Excel的兼容性,后来OP要求保存为Excel文件(.XLS))

Solution

This can be solved in following steps:

这可以通过以下步骤解决:

step 1 We define (declare) our work-space
step 2 We create output folder in work-space if not exist
step 3 We read all existing files in work-space folder and process them in the Algorithm.
step 4 The results of each file will saved as Excel file in output folder.

步骤1我们定义(声明)我们的工作空间步骤2我们在工作空间中创建输出文件夹(如果不存在)步骤3我们读取工作空间文件夹中的所有现有文件并在算法中处理它们。步骤4每个文件的结果将保存为输出文件夹中的Excel文件。

The code

First of all you need to import POI package, this will allow you to create XLS sheet. I have downloaded this poi/poi-3.5-FINAL.jar.zip( 1,372 k) and the following imports should added to your code.

首先,您需要导入POI包,这将允许您创建XLS表。我已下载此poi / poi-3.5-FINAL.jar.zip(1,372 k),并且以下导入应添加到您的代码中。

import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFRow;

Next you added following code to your code, it is self explainable code:

接下来,您将以下代码添加到您的代码中,它是可自行解释的代码:

final static String WORKSPACE = "C:/testfolder/";

private static void createOutputFolder(String outputFolderName) {
    File outputDirectory = new File(WORKSPACE + outputFolderName);

    if (!outputDirectory.exists()) {
        try {
            outputDirectory.mkdir();
        } catch (Exception e) {
        }
    }
}

private static void exlCreator() {

    String outputFolder = "output/";
    String fileName, fileNameWPathInput;
    int serialNumber = 1;
    createOutputFolder(outputFolder);

    MaxDuplicateWordCount mdc = new MaxDuplicateWordCount();
    File folder = new File(WORKSPACE);
    File[] listOfFiles = folder.listFiles();

    for (int i = 0; i < listOfFiles.length; i++) {
        if (listOfFiles[i].isFile()) {
            fileName = listOfFiles[i].getName();
            fileNameWPathInput = WORKSPACE + fileName;
            Map<String, Integer> wordMap = mdc.getWordCount(fileNameWPathInput);
            List<Entry<String, Integer>> list = mdc.sortByValue(wordMap);
            String fileNameWPathOutput = WORKSPACE + outputFolder +
                    fileName.substring(0, fileName.length() - 4)
                    + "output.xls";
            try {
                HSSFWorkbook workbook = new HSSFWorkbook();
                HSSFSheet sheet = workbook.createSheet("ResultSheet");

                HSSFRow rowhead = sheet.createRow((short) 0);
                rowhead.createCell(0).setCellValue("Serial No.");
                rowhead.createCell(1).setCellValue("Word");
                rowhead.createCell(2).setCellValue("Count");

                for (Map.Entry<String, Integer> entry : list) {
                    HSSFRow row = sheet.createRow((short) serialNumber);
                    row.createCell(0).setCellValue(serialNumber);
                    row.createCell(1).setCellValue(entry.getKey());
                    row.createCell(2).setCellValue(entry.getValue());
                    serialNumber++;
                }
                FileOutputStream fileOut = new FileOutputStream(fileNameWPathOutput);
                workbook.write(fileOut);
                fileOut.close();
                serialNumber = 1;
                System.out.println(fileNameWPathOutput + " is created");

            } catch (Exception ex) {
                System.out.println(ex);
            }
        }
    }


}

public static void main(String [] args) throws IOException {
    exlCreator();
}

Finally

By manipulating the code, it is possible to create one output file but create each output results in work sheets. As you can see in the image below, the output file is opened in Excel showing Unicode text with out problem, as it was the issue in my first solution: 如何从目录或文件夹中读取重复的单词计数

通过操作代码,可以创建一个输出文件,但可以在工作表中创建每个输出结果。正如您在下图中看到的那样,输出文件在Excel中打开,显示没有问题的Unicode文本,因为这是我的第一个解决方案中的问题:

Links

Download POI
POI documentation
Unicode problem in CSV
More about CSV

下载POI POI文档CSV中的Unicode问题有关CSV的更多信息

Full code, requested from OP

import java.io.*;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.Map.Entry;
//for Excel ark
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFRow;

public class MaxDuplicateWordCount {

    public Map<String, Integer> getWordCount(String fileName) {

        FileInputStream fis = null;
        DataInputStream dis = null;
        BufferedReader br = null;
        Map<String, Integer> wordMap = new HashMap<String, Integer>();

        try {
            fis = new FileInputStream(fileName);
            dis = new DataInputStream(fis);
            br = new BufferedReader(new InputStreamReader(dis));
            String line = null;
            while ((line = br.readLine()) != null) {
                StringTokenizer st = new StringTokenizer(line, " ");
                while (st.hasMoreTokens()) {
                    String tmp = st.nextToken().toLowerCase();
                    if (wordMap.containsKey(tmp)) {
                        wordMap.put(tmp, wordMap.get(tmp) + 1);
                    } else {
                        wordMap.put(tmp, 1);
                    }
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null) br.close();
            } catch (Exception ex) {
            }
        }
        return wordMap;
    }

    public List<Entry<String, Integer>> sortByValue(Map<String, Integer> wordMap) {

        Set<Entry<String, Integer>> set = wordMap.entrySet();
        List<Entry<String, Integer>> list = new ArrayList<Entry<String, Integer>>(set);
        Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {

            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {

                return (o2.getValue()).compareTo(o1.getValue());
            }


        });
        return list;
    }

    final static String WORKSPACE = "C:/testfolder/";

    private static void createOutputFolder(String outputFolderName) {
        File outputDirectory = new File(WORKSPACE + outputFolderName);

        if (!outputDirectory.exists()) {
            try {
                outputDirectory.mkdir();
            } catch (Exception e) {
            }
        }
    }

    private static void exlCreator() {

        String outputFolder = "output/";
        String fileName, fileNameWPathInput;
        int serialNumber = 1;
        createOutputFolder(outputFolder);

        MaxDuplicateWordCount mdc = new MaxDuplicateWordCount();
        File folder = new File(WORKSPACE);
        File[] listOfFiles = folder.listFiles();

        for (int i = 0; i < listOfFiles.length; i++) {
            if (listOfFiles[i].isFile()) {
                fileName = listOfFiles[i].getName();
                fileNameWPathInput = WORKSPACE + fileName;
                Map<String, Integer> wordMap = mdc.getWordCount(fileNameWPathInput);
                List<Entry<String, Integer>> list = mdc.sortByValue(wordMap);
                String fileNameWPathOutput = WORKSPACE + outputFolder +
                        fileName.substring(0, fileName.length() - 4)
                        + "output.xls";
                try {
                    HSSFWorkbook workbook = new HSSFWorkbook();
                    HSSFSheet sheet = workbook.createSheet("ResultSheet");

                    HSSFRow rowhead = sheet.createRow((short) 0);
                    rowhead.createCell(0).setCellValue("Serial No.");
                    rowhead.createCell(1).setCellValue("Word");
                    rowhead.createCell(2).setCellValue("Count");

                    for (Map.Entry<String, Integer> entry : list) {
                        HSSFRow row = sheet.createRow((short) serialNumber);
                        row.createCell(0).setCellValue(serialNumber);
                        row.createCell(1).setCellValue(entry.getKey());
                        row.createCell(2).setCellValue(entry.getValue());
                        serialNumber++;
                    }
                    FileOutputStream fileOut = new FileOutputStream(fileNameWPathOutput);
                    workbook.write(fileOut);
                    fileOut.close();
                    serialNumber = 1;
                    System.out.println(fileNameWPathOutput + " is created");

                } catch (Exception ex) {
                    System.out.println(ex);
                }
            }
        }


    }

    public static void main(String[] args) throws IOException {
        exlCreator();
    }
}

#2


0  

Let say you have a directory with all the files you want to read from.

假设您有一个包含要读取的所有文件的目录。

File folder = new File("/Users/you/folder/");
File[] listOfFiles = folder.listFiles();

for (File file : listOfFiles) {

    if (file.isFile()) {
        /*
         * Here if your file is not a text file 
         * If I undersood you correct:
         *      "And how to call that files if it is not stored as String"
         * you can get it as byte[] and parse it to String
         */
        byte[] bytes = Files.readAllBytes(file.toPath());
        String decoded = new String(bytes, "UTF-8");
        String[] words = decoded.split("\\s+");
        for (int i = 0; i < words.length; i++) {
            /*  You may want to check for a non-word character before blindly
             *  performing a replacement
             *  It may also be necessary to adjust the character class
             */
             words[i] = words[i].replaceAll("[^\\w]", "");
             //Here are all the words from a file. You can do whatever you want with them
         }
     }

}

#1


2  

Intro

After chatting with OP, here is briefly what OP requires:

在与OP聊天之后,简要介绍OP需要什么:

1- Read file/s from specific folder, files are typically Unicode as text files.
2- The files will be process in OP Algorithm in the Question, and the results of the Algorithm should be saved on Unicode file again (Later OP asked to be saved as Excel file (.XLS) because of Unicode compatibility with Excel)

1-从特定文件夹中读取文件,文件通常是Unicode作为文本文件。 2-文件将在问题中的OP算法中处理,并且算法的结果应该再次保存在Unicode文件中(由于Unicode与Excel的兼容性,后来OP要求保存为Excel文件(.XLS))

Solution

This can be solved in following steps:

这可以通过以下步骤解决:

step 1 We define (declare) our work-space
step 2 We create output folder in work-space if not exist
step 3 We read all existing files in work-space folder and process them in the Algorithm.
step 4 The results of each file will saved as Excel file in output folder.

步骤1我们定义(声明)我们的工作空间步骤2我们在工作空间中创建输出文件夹(如果不存在)步骤3我们读取工作空间文件夹中的所有现有文件并在算法中处理它们。步骤4每个文件的结果将保存为输出文件夹中的Excel文件。

The code

First of all you need to import POI package, this will allow you to create XLS sheet. I have downloaded this poi/poi-3.5-FINAL.jar.zip( 1,372 k) and the following imports should added to your code.

首先,您需要导入POI包,这将允许您创建XLS表。我已下载此poi / poi-3.5-FINAL.jar.zip(1,372 k),并且以下导入应添加到您的代码中。

import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFRow;

Next you added following code to your code, it is self explainable code:

接下来,您将以下代码添加到您的代码中,它是可自行解释的代码:

final static String WORKSPACE = "C:/testfolder/";

private static void createOutputFolder(String outputFolderName) {
    File outputDirectory = new File(WORKSPACE + outputFolderName);

    if (!outputDirectory.exists()) {
        try {
            outputDirectory.mkdir();
        } catch (Exception e) {
        }
    }
}

private static void exlCreator() {

    String outputFolder = "output/";
    String fileName, fileNameWPathInput;
    int serialNumber = 1;
    createOutputFolder(outputFolder);

    MaxDuplicateWordCount mdc = new MaxDuplicateWordCount();
    File folder = new File(WORKSPACE);
    File[] listOfFiles = folder.listFiles();

    for (int i = 0; i < listOfFiles.length; i++) {
        if (listOfFiles[i].isFile()) {
            fileName = listOfFiles[i].getName();
            fileNameWPathInput = WORKSPACE + fileName;
            Map<String, Integer> wordMap = mdc.getWordCount(fileNameWPathInput);
            List<Entry<String, Integer>> list = mdc.sortByValue(wordMap);
            String fileNameWPathOutput = WORKSPACE + outputFolder +
                    fileName.substring(0, fileName.length() - 4)
                    + "output.xls";
            try {
                HSSFWorkbook workbook = new HSSFWorkbook();
                HSSFSheet sheet = workbook.createSheet("ResultSheet");

                HSSFRow rowhead = sheet.createRow((short) 0);
                rowhead.createCell(0).setCellValue("Serial No.");
                rowhead.createCell(1).setCellValue("Word");
                rowhead.createCell(2).setCellValue("Count");

                for (Map.Entry<String, Integer> entry : list) {
                    HSSFRow row = sheet.createRow((short) serialNumber);
                    row.createCell(0).setCellValue(serialNumber);
                    row.createCell(1).setCellValue(entry.getKey());
                    row.createCell(2).setCellValue(entry.getValue());
                    serialNumber++;
                }
                FileOutputStream fileOut = new FileOutputStream(fileNameWPathOutput);
                workbook.write(fileOut);
                fileOut.close();
                serialNumber = 1;
                System.out.println(fileNameWPathOutput + " is created");

            } catch (Exception ex) {
                System.out.println(ex);
            }
        }
    }


}

public static void main(String [] args) throws IOException {
    exlCreator();
}

Finally

By manipulating the code, it is possible to create one output file but create each output results in work sheets. As you can see in the image below, the output file is opened in Excel showing Unicode text with out problem, as it was the issue in my first solution: 如何从目录或文件夹中读取重复的单词计数

通过操作代码,可以创建一个输出文件,但可以在工作表中创建每个输出结果。正如您在下图中看到的那样,输出文件在Excel中打开,显示没有问题的Unicode文本,因为这是我的第一个解决方案中的问题:

Links

Download POI
POI documentation
Unicode problem in CSV
More about CSV

下载POI POI文档CSV中的Unicode问题有关CSV的更多信息

Full code, requested from OP

import java.io.*;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.Map.Entry;
//for Excel ark
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFRow;

public class MaxDuplicateWordCount {

    public Map<String, Integer> getWordCount(String fileName) {

        FileInputStream fis = null;
        DataInputStream dis = null;
        BufferedReader br = null;
        Map<String, Integer> wordMap = new HashMap<String, Integer>();

        try {
            fis = new FileInputStream(fileName);
            dis = new DataInputStream(fis);
            br = new BufferedReader(new InputStreamReader(dis));
            String line = null;
            while ((line = br.readLine()) != null) {
                StringTokenizer st = new StringTokenizer(line, " ");
                while (st.hasMoreTokens()) {
                    String tmp = st.nextToken().toLowerCase();
                    if (wordMap.containsKey(tmp)) {
                        wordMap.put(tmp, wordMap.get(tmp) + 1);
                    } else {
                        wordMap.put(tmp, 1);
                    }
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null) br.close();
            } catch (Exception ex) {
            }
        }
        return wordMap;
    }

    public List<Entry<String, Integer>> sortByValue(Map<String, Integer> wordMap) {

        Set<Entry<String, Integer>> set = wordMap.entrySet();
        List<Entry<String, Integer>> list = new ArrayList<Entry<String, Integer>>(set);
        Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {

            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {

                return (o2.getValue()).compareTo(o1.getValue());
            }


        });
        return list;
    }

    final static String WORKSPACE = "C:/testfolder/";

    private static void createOutputFolder(String outputFolderName) {
        File outputDirectory = new File(WORKSPACE + outputFolderName);

        if (!outputDirectory.exists()) {
            try {
                outputDirectory.mkdir();
            } catch (Exception e) {
            }
        }
    }

    private static void exlCreator() {

        String outputFolder = "output/";
        String fileName, fileNameWPathInput;
        int serialNumber = 1;
        createOutputFolder(outputFolder);

        MaxDuplicateWordCount mdc = new MaxDuplicateWordCount();
        File folder = new File(WORKSPACE);
        File[] listOfFiles = folder.listFiles();

        for (int i = 0; i < listOfFiles.length; i++) {
            if (listOfFiles[i].isFile()) {
                fileName = listOfFiles[i].getName();
                fileNameWPathInput = WORKSPACE + fileName;
                Map<String, Integer> wordMap = mdc.getWordCount(fileNameWPathInput);
                List<Entry<String, Integer>> list = mdc.sortByValue(wordMap);
                String fileNameWPathOutput = WORKSPACE + outputFolder +
                        fileName.substring(0, fileName.length() - 4)
                        + "output.xls";
                try {
                    HSSFWorkbook workbook = new HSSFWorkbook();
                    HSSFSheet sheet = workbook.createSheet("ResultSheet");

                    HSSFRow rowhead = sheet.createRow((short) 0);
                    rowhead.createCell(0).setCellValue("Serial No.");
                    rowhead.createCell(1).setCellValue("Word");
                    rowhead.createCell(2).setCellValue("Count");

                    for (Map.Entry<String, Integer> entry : list) {
                        HSSFRow row = sheet.createRow((short) serialNumber);
                        row.createCell(0).setCellValue(serialNumber);
                        row.createCell(1).setCellValue(entry.getKey());
                        row.createCell(2).setCellValue(entry.getValue());
                        serialNumber++;
                    }
                    FileOutputStream fileOut = new FileOutputStream(fileNameWPathOutput);
                    workbook.write(fileOut);
                    fileOut.close();
                    serialNumber = 1;
                    System.out.println(fileNameWPathOutput + " is created");

                } catch (Exception ex) {
                    System.out.println(ex);
                }
            }
        }


    }

    public static void main(String[] args) throws IOException {
        exlCreator();
    }
}

#2


0  

Let say you have a directory with all the files you want to read from.

假设您有一个包含要读取的所有文件的目录。

File folder = new File("/Users/you/folder/");
File[] listOfFiles = folder.listFiles();

for (File file : listOfFiles) {

    if (file.isFile()) {
        /*
         * Here if your file is not a text file 
         * If I undersood you correct:
         *      "And how to call that files if it is not stored as String"
         * you can get it as byte[] and parse it to String
         */
        byte[] bytes = Files.readAllBytes(file.toPath());
        String decoded = new String(bytes, "UTF-8");
        String[] words = decoded.split("\\s+");
        for (int i = 0; i < words.length; i++) {
            /*  You may want to check for a non-word character before blindly
             *  performing a replacement
             *  It may also be necessary to adjust the character class
             */
             words[i] = words[i].replaceAll("[^\\w]", "");
             //Here are all the words from a file. You can do whatever you want with them
         }
     }

}