从文件夹中的所有tsv文件中提取前三列

I have several tsv files in a folder which add up to over 50 gb total. To make it easier on memory when loading these files into R, I want to extract only the first 3 columns of these files.

我在一个文件夹中有几个tsv文件，总计超过50 GB。为了在将这些文件加载到R中时使内存更容易，我想只提取这些文件的前3列。

How can all of the files have their columns extracted at once in terminal? I am running Ubuntu 16.04.

如何在终端中一次性提取所有文件的列？我正在运行Ubuntu 16.04。

5 个解决方案

#1

Something like the following should work:

像下面这样的东西应该工作：

#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
    # Do something for each file. In our case, just echo the first three fields:
    cut -f1-3 < "$f"
done

(See this webpage for more info on iterating over files in bash.)

（有关在bash中迭代文件的详细信息，请参阅此网页。）

The answer by M. Becerra contains a one-liner in which the same can be achieved using the find command. My own answer can thus be considered more complicated than necessary, unless you want to do additional processing for each file (e.g., construct some statistics while iterating over the files).

M. Becerra的答案包含一个单行程，其中使用find命令可以实现相同的目的。因此，我自己的答案可能被认为比必要的更复杂，除非你想为每个文件做额外的处理（例如，在迭代文件时构造一些统计数据）。

EDIT: If you want to overwrite the actual files, you can use something like the following script instead:

编辑：如果要覆盖实际文件，可以使用类似以下脚本的内容：

#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
    # Do something for each file. In our case, echo the first three fields to a new file, and rename the new file to the original file:
    cut -f1-3 < "$f" > "$f.tmp"
    rm "$f"
    mv "$f.tmp" "$f"
done

The cut line writes its output to the original filename with .tmp appended; the following two lines remove the original file and rename the new file to the original filename.

剪切线将其输出写入原始文件名，并附加.tmp;以下两行删除原始文件并将新文件重命名为原始文件名。

#2

Do it directly in R--this will save time, disk space, and code:

直接在R中执行 - 这将节省时间，磁盘空间和代码：

fread("foo.tsv", sep = "\t", select=c("f1", "f2", "f3"))

#3

This looks like a perfect use case for the cut utility

这看起来像切割实用程序的完美用例

You can use it as follows:

您可以按如下方式使用它：

cut -d$"\t" -f 1-3 folder/*

Where -d specifies the field delimiter (in this case tabs), -f specifies the fields to extract and folder/* is a glob specifying all files to be parsed.

其中-d指定字段分隔符（在本例中为制表符），-f指定要提取的字段，文件夹/ *是指定要解析的所有文件的glob。

#4

You can do:

你可以做：

find ./ -type f -name ".tsv" -exec awk '{ print $1,$2,$3 }' {} \;

You can run it from the directory where you have the files, or just add the absolute path instead.

您可以从拥有文件的目录运行它，或者只添加绝对路径。

If you want to have it saved into a file you can redirect the output of awk:

如果你想将它保存到文件中，你可以重定向awk的输出：

find ./ -type f -name ".tsv" -exec awk '{ print $1,$2,$3 }' {} >> someOtherFile \;

#5

There are a couple ways to do this directly in R, depending on what packages are installed. These methods all keep memory use to a minimum.

有几种方法可以直接在R中执行此操作，具体取决于安装的软件包。这些方法都将内存使用降至最低。

With the base (default) package (creates a data.frame):

使用base（默认）包（创建data.frame）：

> df1 = read.table(pipe("cut -f 1-3 *.tsv"), sep="\t", header=FALSE, quote="")

Using the tidyverse/readr package (creates a tibble):

使用tidyverse / readr包（创建一个tibble）：

> df2 = read_tsv(pipe("cut -f 1-3 *.tsv"))

Using data.table(creates a data.table, or optionally, a data.frame):

使用data.table（创建data.table，或者可选地，data.frame）：

> df3 = fread("cut -f 1-3 *.tsv")

Each of these techniques invokes a unix shell command, reading the output of the command. This minimizes memory use. An arbitrary shell pipeline can be used, so other commands can be combined. For example, to get a random set of 10,000 lines:

这些技术中的每一种都调用unix shell命令，读取命令的输出。这最大限度地减少了内存使可以使用任意shell管道，因此可以组合其他命令。例如，要获得10,000行的随机集合：

> df4 = fread("cut -f 1,3 *.tsv | shuf -n 10000")

Each of these methods has a full array of options for customizing the input process.

这些方法中的每一种都有一整套用于自定义输入过程的选项。

#1