我在哪里下载基因表达数据?

时间:2022-08-11 16:59:25

i wanted to download gene expression data derived from generated by microarray experiments. i do not know too much about this subject, but as i understand, rows often correspond to genes and columns corresponds to samples. ideally, i expect a matrix of gene expression data.

我想下载源自微阵列实验产生的基因表达数据。我不太了解这个主题,但据我所知,行通常对应于基因,列对应于样本。理想情况下,我期待一个基因表达数据矩阵。

i've been searching on the internet, and although it may seem like there are many places to download such data, when i actually do download the data, i do not get the matrix of gene expression. could someone please let me know if there is a place or how to download gene expression data in the format that i expect above?

我一直在互联网上搜索,虽然看起来有很多地方可以下载这些数据,当我实际上下载数据时,我没有得到基因表达的矩阵。有人可以告诉我,如果有一个地方或如何以我期望的格式下载基因表达数据?

any help is appreciated.

任何帮助表示赞赏。

2 个解决方案

#1


6  

If you look at e.g. this entry in the Gene Expression Omnibus, one of the file formats is "TXT" and contains a matrix like you are asking for, after some metadata.

如果你看一下,例如Gene Expression Omnibus中的这个条目,其中一个文件格式是“TXT”,并且在一些元数据之后包含一个像你要求的矩阵。

#2


5  

In principle microarray data can be expressed (please pardon the pun) as a matrix with samples as columns and rows as genes. In practice it is a good bit more complicated to derive such a representation for the raw data of an experiment. If you just get a pre-processed dataset you have little guarantee that the raw data was processed in a way that makes it comparable to other experiments or that the underlying raw data was of sufficiently high quality.

原则上,微阵列数据可以表示(请原谅双关语)作为矩阵,样本作为列和行作为基因。在实践中,为实验的原始数据导出这样的表示是更复杂的。如果您只获得预处理数据集,则几乎无法保证原始数据的处理方式与其他实验相当,或者基础原始数据的质量足够高。

You are also going to need high quality metadata to derive any meaning from the data matrix. What were the biological conditions and sources from which the samples were derived? What genes do the probes on the particular array used correspond to? (Note that 9890_at is "probeset id", a unique identifier of a molecular probe of a particular sequence design which then needs to be mapped to a gene, different probes for the same gene won't give exactly the same response.)

您还需要高质量的元数据来从数据矩阵中获取任何含义。样本来源的生物条件和来源是什么?使用的特定阵列上的探针对应于哪些基因? (注意,9890_at是“探针组id”,是特定序列设计的分子探针的唯一标识符,然后需要将其映射到基因,同一基因的不同探针不会给出完全相同的响应。)

The public microrarray databases therefore provide a lot of additional information in addition to a processed data matrix. In addition to GEO that has already been mentioned I would recommend ArrayExpress which in my opinion has the better search interface.

因此,除了处理后的数据矩阵之外,公共微阵列数据库还提供了许多附加信息。除了已经提到的GEO之外,我还推荐使用ArrayExpress,在我看来它有更好的搜索界面。

The tool of choice to work with microarray data for many is the bioconductor suite of software for the statistical programming language R.

对于许多人来说,使用微阵列数据的首选工具是用于统计编程语言R的bioconductor软件套件。

Bioconductor provides APIs to download raw data with accompanying metadata from both repositories, see the GEO bioc package and ArrayExpress bioc package.

Bioconductor提供API以下载来自两个存储库的伴随元数据的原始数据,请参阅GEO bioc包和ArrayExpress bioc包。

Both packages, in common with most bioconductor software come with excellent "vignettes" that introduce the software: GEO bioc vignette and Arrayexpress bioc vignette

与大多数bioconductor软件一样,这两个软件包都带有引入软件的优秀“小插图”:GEO bioc vignette和Arrayexpress bioc vignette

Those vignettes should also give you examples of taking the raw data and deriving "Esets" (expression sets) from the raw data. At that point you can access the gene expression matrix in the bioconductor Eset object, and you have an object and APIs to interrogate the necessary metadata.

这些小插图还应该为您提供获取原始数据并从原始数据中导出“Esets”(表达式集)的示例。此时,您可以访问bioconductor Eset对象中的基因表达矩阵,并且您有一个对象和API来查询必要的元数据。

Note that there are different types of microarray. I would recommend starting with data from Affymetrix arrays as they have probably the most straightforward analysis APIs.

注意,存在不同类型的微阵列。我建议从Affymetrix数组开始,因为它们可能是最简单的分析API。

#1


6  

If you look at e.g. this entry in the Gene Expression Omnibus, one of the file formats is "TXT" and contains a matrix like you are asking for, after some metadata.

如果你看一下,例如Gene Expression Omnibus中的这个条目,其中一个文件格式是“TXT”,并且在一些元数据之后包含一个像你要求的矩阵。

#2


5  

In principle microarray data can be expressed (please pardon the pun) as a matrix with samples as columns and rows as genes. In practice it is a good bit more complicated to derive such a representation for the raw data of an experiment. If you just get a pre-processed dataset you have little guarantee that the raw data was processed in a way that makes it comparable to other experiments or that the underlying raw data was of sufficiently high quality.

原则上,微阵列数据可以表示(请原谅双关语)作为矩阵,样本作为列和行作为基因。在实践中,为实验的原始数据导出这样的表示是更复杂的。如果您只获得预处理数据集,则几乎无法保证原始数据的处理方式与其他实验相当,或者基础原始数据的质量足够高。

You are also going to need high quality metadata to derive any meaning from the data matrix. What were the biological conditions and sources from which the samples were derived? What genes do the probes on the particular array used correspond to? (Note that 9890_at is "probeset id", a unique identifier of a molecular probe of a particular sequence design which then needs to be mapped to a gene, different probes for the same gene won't give exactly the same response.)

您还需要高质量的元数据来从数据矩阵中获取任何含义。样本来源的生物条件和来源是什么?使用的特定阵列上的探针对应于哪些基因? (注意,9890_at是“探针组id”,是特定序列设计的分子探针的唯一标识符,然后需要将其映射到基因,同一基因的不同探针不会给出完全相同的响应。)

The public microrarray databases therefore provide a lot of additional information in addition to a processed data matrix. In addition to GEO that has already been mentioned I would recommend ArrayExpress which in my opinion has the better search interface.

因此,除了处理后的数据矩阵之外,公共微阵列数据库还提供了许多附加信息。除了已经提到的GEO之外,我还推荐使用ArrayExpress,在我看来它有更好的搜索界面。

The tool of choice to work with microarray data for many is the bioconductor suite of software for the statistical programming language R.

对于许多人来说,使用微阵列数据的首选工具是用于统计编程语言R的bioconductor软件套件。

Bioconductor provides APIs to download raw data with accompanying metadata from both repositories, see the GEO bioc package and ArrayExpress bioc package.

Bioconductor提供API以下载来自两个存储库的伴随元数据的原始数据,请参阅GEO bioc包和ArrayExpress bioc包。

Both packages, in common with most bioconductor software come with excellent "vignettes" that introduce the software: GEO bioc vignette and Arrayexpress bioc vignette

与大多数bioconductor软件一样,这两个软件包都带有引入软件的优秀“小插图”:GEO bioc vignette和Arrayexpress bioc vignette

Those vignettes should also give you examples of taking the raw data and deriving "Esets" (expression sets) from the raw data. At that point you can access the gene expression matrix in the bioconductor Eset object, and you have an object and APIs to interrogate the necessary metadata.

这些小插图还应该为您提供获取原始数据并从原始数据中导出“Esets”(表达式集)的示例。此时,您可以访问bioconductor Eset对象中的基因表达矩阵,并且您有一个对象和API来查询必要的元数据。

Note that there are different types of microarray. I would recommend starting with data from Affymetrix arrays as they have probably the most straightforward analysis APIs.

注意,存在不同类型的微阵列。我建议从Affymetrix数组开始,因为它们可能是最简单的分析API。