如何在Node.js中读取非常大(> 1GB)的tar.gz文件?

时间:2021-02-13 22:52:04

I have never had to do this before so this is probably something really basic, but I thought I'd ask anyways.

我以前从来没有这样做过,所以这可能是非常基本的东西,但我想我还是会问。

What is the right way to read a very large file in Node.js? Say the file is just too large to read all at once. Also say the file could come in as a .zip or .tar.gz format.

在Node.js中读取非常大的文件的正确方法是什么?假设文件太大而无法一次读取。另外说该文件可以以.zip或.tar.gz格式输入。

First question, is it best to decompress the file first and save it to disk (I'm using Stuffit on the Mac to do this now), and then work with that file? Or can you read the IO stream straight from the compressed .zip or .tar.gz version? I guess you'd need to know the format of the content in the compressed file, so you probably have to decompress (just found out this .tar.gz file is actually a .dat file)...

第一个问题,最好先解压缩文件并将其保存到磁盘(我现在在Mac上使用Stuffit来执行此操作),然后使用该文件?或者您可以直接从压缩的.zip或.tar.gz版本中读取IO流吗?我想你需要知道压缩文件中内容的格式,所以你可能需要解压缩(刚发现这个.tar.gz文件实际上是.dat文件)...

Then the main issue is, how do I read this large file in Node.js? Say it's a 1GB XML file, where should I look to get started in parsing it? (Not, how to parse XML, but if you're reading the large file line-by-line, how do you parse something like XML which needs to know the context of previous lines).

那么主要的问题是,如何在Node.js中读取这个大文件?假设它是一个1GB的XML文件,我应该在哪里开始解析它? (不是,如何解析XML,但是如果你逐行读取大文件,你如何解析像XML这样需要知道前一行的上下文的东西)。

I have seen fs.createReadStream, but I'm afraid to mess around with it... don't want to explode my computer. Just looking for some pointers in the right direction.

我见过fs.createReadStream,但我不敢乱用它......不想爆炸我的电脑。只是寻找正确方向的一些指针。

2 个解决方案

#1


9  

there is built-in zlib module for stream decompression and sax for stream XML parsing

有用于流解压缩的内置zlib模块和用于流XML解析的sax

var fs = require('fs');
var zlib = require('zlib');
var sax = require('sax');

var saxStream = sax.createStream();
// add your xml handlers here

fs.createReadStream('large.xml.gz').pipe(zlib.createUnzip()).pipe(saxStream);

#2


1  

We can also zip the directory something like the following :

我们还可以将目录压缩如下:

var spawn = require('child_process').spawn;
var pathToArchive = './very_large_folder.tar.gz';
var pathToFolder = './very_large_folder';

var tar = spawn('tar', ['czf', pathToArchive, pathToFolder]);
tar.on('exit', function (code) {
        if (code === 0) {
                console.log('completed successfully');
        } else {
                console.log('error');
        }
});

This worked nicely :)

这很好用:)

#1


9  

there is built-in zlib module for stream decompression and sax for stream XML parsing

有用于流解压缩的内置zlib模块和用于流XML解析的sax

var fs = require('fs');
var zlib = require('zlib');
var sax = require('sax');

var saxStream = sax.createStream();
// add your xml handlers here

fs.createReadStream('large.xml.gz').pipe(zlib.createUnzip()).pipe(saxStream);

#2


1  

We can also zip the directory something like the following :

我们还可以将目录压缩如下:

var spawn = require('child_process').spawn;
var pathToArchive = './very_large_folder.tar.gz';
var pathToFolder = './very_large_folder';

var tar = spawn('tar', ['czf', pathToArchive, pathToFolder]);
tar.on('exit', function (code) {
        if (code === 0) {
                console.log('completed successfully');
        } else {
                console.log('error');
        }
});

This worked nicely :)

这很好用:)