在java中逐行读取.txt文件

时间:2022-11-20 20:20:29

i am trying to read a .txt file in java and create a list of lists as to put every line of that .txt to another list. For every file i tried to do this all were fine but with the facebook_combined.txt.gz file which is at this link it doesnt do it the right way. Example:

我试图在java中读取一个.txt文件,并创建一个列表列表,将该.txt的每一行放到另一个列表中。对于我试图这样做的每个文件一切都很好但是使用facebook_combined.txt.gz文件,在这个链接上它并没有正确的方式。例:

if the first line of another .txt file is like this 52 99 45 61 70 45 and the second like this 70 80 65 91 then my code should create the list of lists named lines and lines must be like this:

如果另一个.txt文件的第一行像这样52 99 45 61 70 45而第二行像这样70 80 65 91那么我的代码应该创建名为行的列表列表,行必须是这样的:

line=[[52,99,45,61,70,45][70,80,65,91]].

But for the facebook_combinded.txt file if we suppose that its first line is like this 0 10 20 30 40 50 the same code creates the list of lists lines like this:

但对于facebook_combinded.txt文件,如果我们假设它的第一行是这样的0 10 20 30 40 50,则相同的代码会创建列表行列表,如下所示:

lines=[[0,1][0,2][0,3][0,4][0,5][0,...]].

The code i use is below:

我使用的代码如下:

 ArrayList<ArrayList<String>> lines = new ArrayList<ArrayList<String>>();

//read the file
FileInputStream fstream = new FileInputStream("C:\\Users\\facebook_combined.txt");
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));

while (true)//while the file was read
{
    String line = br.readLine();//split the file into the lines
    if (line == null) 
    {
        break;//if there are no more lines left
    }

    Scanner tokenize = new Scanner(line);// split the lines into tokens and make into an arraylist
    ArrayList<String> tokens = new ArrayList<String>();

    while (tokenize.hasNext()) //while there are still more
    {
        tokens.add(tokenize.next());
    }
    lines.add(tokens);
}
    br.close();

3 个解决方案

#1


I downloaded the dataset and extracted the text file with 7Zip and it looks like your program is working. When you extract the file, the data looks something like this (using Notepad++) . . .

我下载了数据集并使用7Zip解压缩了文本文件,看起来您的程序正在运行。提取文件时,数据看起来像这样(使用Notepad ++)。 。 。

0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
...etc...

I opened the file with regular Notepad and the carriage returns are not visible so that may have caused the confusion (that is the data looks like 0 10 20 30 40... in Notepad)

我使用常规记事本打开文件,并且回车不可见,这可能导致混淆(即数据看起来像0 10 20 30 40 ...在记事本中)


EDIT: Updated Explanation

编辑:更新说明

In response to OP

回应OP

You are right for the way that the data look like in notepad++ but the right version is 0 10 20 30

你是正确的数据在notepad ++中的样子,但正确的版本是0 10 20 30

I am not sure that is correct. Beware of Occam's Razor, you are assuming the data should be parsed 0 10 20 30 even though the file is providing very explicit carriage returns. If the file was not supposed to have the carriage returns, it would not have had them. Similarly, it doesn't seem to be an error in formatting of the file as the format is consistently a pair of numbers followed by a carriage return. There is nothing pointing to the data being parsed as 0 10 20 30 40 . . .

我不确定这是否正确。谨防Occam的Razor,你假设数据应该被解析为0 10 20 30,即使该文件提供非常明确的回车。如果该文件不应该有回车符,则不会有它们。类似地,它在格式化文件时似乎不是错误,因为格式始终是一对数字后跟回车符。没有任何指向被解析为0 10 20 30 40的数据。 。 。

The file facebook_combined.txt looks to be a list of edges in a graph where each edge is a friendship between two people.

文件facebook_combined.txt看起来是图表中边缘的列表,其中每条边是两个人之间的友谊。

It looks like you are trying to read the "circles" of friends, where a circle is a list of numbers. If you download the other tar file "facebook.tar" there are a couple of files with the extensions *.circles. Here is a snippet from one of those files.

看起来你正试图阅读朋友的“圈子”,其中一个圆圈是一个数字列表。如果您下载其他tar文件“facebook.tar”,则会有几个扩展名为* .circles的文件。这是其中一个文件的片段。

circle0 71  215 54  61  298 229 81  253 193 97  264 29  132 110 163 259 183 334 245 222
circle1 173
circle2 155 99  327 140 116 147 144 150 270
circle3 51  83  237
circle4 125 344 295 257 55  122 223 59  268 280 84  156 258 236 250 239 69
circle5 23
circle6 337 289 93  17  111 52  137 343 192 35  326 310 214 32  115 321 209 312 41  20

These *.circles files seem to be of the format you are expecting (A list of list of numbers).

这些* .circles文件似乎是您期望的格式(数字列表列表)。

#2


I think your code is kinda wrong. I dont usually use "Scanner". But maybe you can use .split()

我认为你的代码有点不对劲。我通常不使用“扫描仪”。但也许你可以使用.split()

I dont like the "while(true)" loops so i recommend change that to this:

我不喜欢“while(true)”循环,所以我建议改为:

String s;
while ((s = br.readLine()) != null) {

And remove your:

并删除你的:

String line = br.readLine();//split the file into the lines
if (line == null) 
{
    break;//if there are no more lines left
}

then try to use split something like this:

然后尝试使用这样的拆分:

String[] tokenize = line.split(" ");
ArrayList<String> tokens = new ArrayList<String>();
for(String s : tokenize){
tokens.add(s);
}

#3


Well, You just say that actually the .txt file looks like

好吧,你只是说实际上.txt文件看起来像

0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8

but you need it like

但是你需要它

   0 10 20 30 40 50

So i think you would need to read all the file, and then replace the carriage returns

所以我认为你需要读取所有文件,然后替换回车符

#1


I downloaded the dataset and extracted the text file with 7Zip and it looks like your program is working. When you extract the file, the data looks something like this (using Notepad++) . . .

我下载了数据集并使用7Zip解压缩了文本文件,看起来您的程序正在运行。提取文件时,数据看起来像这样(使用Notepad ++)。 。 。

0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
...etc...

I opened the file with regular Notepad and the carriage returns are not visible so that may have caused the confusion (that is the data looks like 0 10 20 30 40... in Notepad)

我使用常规记事本打开文件,并且回车不可见,这可能导致混淆(即数据看起来像0 10 20 30 40 ...在记事本中)


EDIT: Updated Explanation

编辑:更新说明

In response to OP

回应OP

You are right for the way that the data look like in notepad++ but the right version is 0 10 20 30

你是正确的数据在notepad ++中的样子,但正确的版本是0 10 20 30

I am not sure that is correct. Beware of Occam's Razor, you are assuming the data should be parsed 0 10 20 30 even though the file is providing very explicit carriage returns. If the file was not supposed to have the carriage returns, it would not have had them. Similarly, it doesn't seem to be an error in formatting of the file as the format is consistently a pair of numbers followed by a carriage return. There is nothing pointing to the data being parsed as 0 10 20 30 40 . . .

我不确定这是否正确。谨防Occam的Razor,你假设数据应该被解析为0 10 20 30,即使该文件提供非常明确的回车。如果该文件不应该有回车符,则不会有它们。类似地,它在格式化文件时似乎不是错误,因为格式始终是一对数字后跟回车符。没有任何指向被解析为0 10 20 30 40的数据。 。 。

The file facebook_combined.txt looks to be a list of edges in a graph where each edge is a friendship between two people.

文件facebook_combined.txt看起来是图表中边缘的列表,其中每条边是两个人之间的友谊。

It looks like you are trying to read the "circles" of friends, where a circle is a list of numbers. If you download the other tar file "facebook.tar" there are a couple of files with the extensions *.circles. Here is a snippet from one of those files.

看起来你正试图阅读朋友的“圈子”,其中一个圆圈是一个数字列表。如果您下载其他tar文件“facebook.tar”,则会有几个扩展名为* .circles的文件。这是其中一个文件的片段。

circle0 71  215 54  61  298 229 81  253 193 97  264 29  132 110 163 259 183 334 245 222
circle1 173
circle2 155 99  327 140 116 147 144 150 270
circle3 51  83  237
circle4 125 344 295 257 55  122 223 59  268 280 84  156 258 236 250 239 69
circle5 23
circle6 337 289 93  17  111 52  137 343 192 35  326 310 214 32  115 321 209 312 41  20

These *.circles files seem to be of the format you are expecting (A list of list of numbers).

这些* .circles文件似乎是您期望的格式(数字列表列表)。

#2


I think your code is kinda wrong. I dont usually use "Scanner". But maybe you can use .split()

我认为你的代码有点不对劲。我通常不使用“扫描仪”。但也许你可以使用.split()

I dont like the "while(true)" loops so i recommend change that to this:

我不喜欢“while(true)”循环,所以我建议改为:

String s;
while ((s = br.readLine()) != null) {

And remove your:

并删除你的:

String line = br.readLine();//split the file into the lines
if (line == null) 
{
    break;//if there are no more lines left
}

then try to use split something like this:

然后尝试使用这样的拆分:

String[] tokenize = line.split(" ");
ArrayList<String> tokens = new ArrayList<String>();
for(String s : tokenize){
tokens.add(s);
}

#3


Well, You just say that actually the .txt file looks like

好吧,你只是说实际上.txt文件看起来像

0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8

but you need it like

但是你需要它

   0 10 20 30 40 50

So i think you would need to read all the file, and then replace the carriage returns

所以我认为你需要读取所有文件,然后替换回车符