使用Ruby解析SEC Edgar XML文件到Nokogiri

I'm having problems parsing the SEC Edgar files

我在分析SEC Edgar文件时遇到了问题。

这是这个文件的一个例子。

The end result is I want the stuff between <XML> and </XML> into a format I can access.

最终的结果是，我希望和之间的内容成为我可以访问的格式。

Here is my code so far that doesn't work:

这是我的代码到目前为止还不能工作:

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)

3 个解决方案

#1

Ok, there are a couple of things wrong:

好吧，有几个地方不对:

sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
sec.gov/ archive /edgar/data/1475481/0001475481-09-000001.txt不是XML，所以Nokogiri对您没有任何用处，除非您将文件顶部的所有垃圾清除掉，直到真正的XML开始，然后修剪掉后面的标记以保持XML的正确性。你需要先解决这个问题。
You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.
你不会从文件中说出你想要什么。没有这些信息，我们无法推荐真正的解决方案。你需要花更多的时间来更好地定义这个问题。

Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:

下面是一段快速的代码，用于检索页面、去除垃圾并将结果内容解析为XML:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603

#2

I recommend practicing in IRB and reading the docs for Nokogiri

我建议你在IRB练习，阅读Nokogiri的医生资料

> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]

that should get you going

那你该走了

#3

Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:

考虑到这是一年前的问题，答案可能是OBE，但是这个人应该做的是检查所有的文件在网站上，并注意实际的归档细节可以在:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm

http://sec.gov/archives/edgar/data/1475481/000147548109000001/0001475481 - 09 - 000001你

Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:

在这个过程中，您将看到XML文档已经被解析出来，以便进行进一步的操作:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml

Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.

但是要注意的是，文档末尾的实际文件名是由文档的提交者决定的，而不是由SEC决定的。

#1