用scikit-learn存储新颖的功能

I'm working on bioinformatics sequence research which involves extracting various sequence based features from many sets of sequences (FASTA files). I generate various features for the sequences, and process each sequence individually (I'll be working with tens of thousands of sequences). I'm a novice at programming and handling data.

我正致力于生物信息学序列研究，该研究涉及从多组序列（FASTA文件）中提取各种基于序列的特征。我为序列生成各种功能，并单独处理每个序列（我将使用数万个序列）。我是编程和处理数据的新手。

What would be the BEST way to store and output (i.e save to a matrix in a csv file) the generated features?

什么是存储和输出（即保存到csv文件中的矩阵）生成的功能的最佳方法？

The names of the features are important to me, so i'll want them, in addition to needing their outputted order to be consistent for each seperate sequence.

这些特征的名称对我来说很重要，所以除了要求它们的输出顺序与每个单独的序列保持一致之外，我还要它们。

I planned to store the features (per sequence) with a dictionary, since I understand that scikit learn's "dictVectorizer" function might work for this. However - A dict is unorganized, and I'll be extracting the features for each sequence individually, writing that out, then extracting from the next sequence - Will it keep the same order when writing out? (All the features are numerical and continious, but many may have a value of 0, and some would have a vector as an output [E.G - 400 frequency counts of overlapping bigrams).

我计划用字典存储功能（每个序列），因为我知道scikit学习的“dictVectorizer”功能可能适用于此。然而 - 一个字典是无组织的，我将分别提取每个序列的特征，写出来，然后从下一个序列中提取 - 写出时它会保持相同的顺序吗？（所有特征都是数字和连续的，但是许多特征可能具有0的值，并且一些特征将具有矢量作为输出[E.G-400重叠的双字母的频率计数]。

Thanks!

谢谢！

(I'm mainly concerned with the I/O and not getting the output of the features mixed. )

（我主要关注I / O而不是混合功能的输出。）

1 个解决方案

#1

So the best way to do this would be either pandas or dill. I think pandas would be your better option but you can use both.

因此，最好的方法是做大熊猫或莳萝。我认为大熊猫是你更好的选择，但你可以同时使用它们。

import pandas as pd
import dill

If you're trying to store matrices with labels you're talking about DataFrames. You can easily make a DataFrame in pandas from an array or dictionary by doing DF = pd.DataFrame(array_object) you can set the column names by DF.columns = ["label1","label2",etc.] and index values by the same method DF.index = ["A","B",etc.]. Then you can just store it into a csv/tsv by doing DF.to_csv("filename.tsv",sep="\t") (I usually do tab separated b/c commas are weird) that writes it to a spreadhseet format that you can easily retrieve later by doing pd.read_table("filename.tsv",sep="\t"). Another way you can do it is using dill which stores your environment variable into a file that you can load later (I do this for more complicated objects but it works for DataFrames or matrices or whatever you want). If you had that DataFrame from the matrix/array you could just do dill.dump(DF,open("DF.obj","wb")) to store that object and then retrieve it later with dill.load(open("DF.obj","rb")).

如果您尝试使用标签存储矩阵，那么您正在谈论DataFrame。您可以通过执行DF = pd.DataFrame（array_object）从数组或字典中轻松地在pandas中创建DataFrame，您可以通过DF.columns = [“label1”，“label2”等]和索引值来设置列名。相同的方法DF.index = [“A”，“B”等]。然后你可以通过执行DF.to_csv（“filename.tsv”，sep =“\ t”）（我通常使用制表符分隔的b / c逗号很奇怪）将它存储到csv / tsv中，将其写入spreadhseet格式您可以通过执行pd.read_table（“filename.tsv”，sep =“\ t”）轻松检索。另一种方法是使用dill将环境变量存储到一个文件中，以后可以加载（我为更复杂的对象执行此操作，但它适用于DataFrame或矩阵或任何您想要的）。如果您有矩阵/数组中的DataFrame，您可以执行dill.dump（DF，open（“DF.obj”，“wb”））来存储该对象，然后使用dill.load检索它（打开（“ DF.obj”， “RB”））。

I would do the pandas.DataFrame.to_csv() method since you can open the stuff in Excel (biologists love excel). If they are just matrices, then using dill to store the objects is overkill. It's a good tool to have in your repertoire if you're making stuff for yourself but passing along stored dill objects to other people can get nasty.

我会做pandas.DataFrame.to_csv（）方法，因为你可以打开Excel中的东西（生物学家喜欢excel）。如果它们只是矩阵，那么使用莳萝来存储对象是过度的。如果你为自己制作东西，但是将存储的莳萝物品传递给其他人可能会变得很讨厌，这是一个很好的工具。

#1

So the best way to do this would be either pandas or dill. I think pandas would be your better option but you can use both.

因此，最好的方法是做大熊猫或莳萝。我认为大熊猫是你更好的选择，但你可以同时使用它们。

import pandas as pd
import dill

秒客网

用scikit-learn存储新颖的功能

1 个解决方案

#1

#1

相关文章