如何确保文件是XML文件

I donot know much about files and its related security. I have a LOT of data in XML files which i am planning on parsing to put in the database. I get these XML files from 3rd party people. I will be getting minimum around 1000 files per day. So i will write a script to parse them to enter in our database. Now i have many questions regarding this.

我对文件及其相关的安全性知之甚少。我在XML文件中有很多数据，我正计划将这些数据解析到数据库中。我从第三方获得这些XML文件。我每天至少会收到1000个文件。因此，我将编写一个脚本来解析它们，以便在数据库中输入。关于这个我有很多问题。

I know how to parse a single file. And i can extend the logic to multiple files in a single loop. But.Is there a better way to do the same? How can i use multi threaded programming to parse the files simultaneously many of them. There will be a script which, given the file, parses the single file and outputs to database. How can i use this script to parse in multiple threads/parallel processing
我知道如何解析一个文件。我可以在一个循环中将逻辑扩展到多个文件。但是。有更好的方法来做同样的事情吗?如何使用多线程编程来同时解析文件中的许多文件。有一个脚本，给定该文件，它将解析单个文件并将输出输出到数据库。如何使用此脚本在多个线程/并行处理中进行解析
The File as i said, Comes from a 3rd party site. So how can i be sure that there are no security loop holes. I mean, i dono much about file security. But what are the MINIMUM common basic security checks i need to take.(like sql injection and XSS in web programing are VERY basic)
正如我所说，这个文件来自一个第三方网站。所以我怎么能确定没有安全环路漏洞呢。我是说，我不太关心文件安全。但是我需要做的最基本的安全检查是什么?(如web编程中的sql注入和XSS是非常基本的)
Again security related: How to ensure that the incoming XML file is XML itself. I mean i can use the extension, But is there a possibility to inject scripts and make them run when i parse these files. And What steps should i take while parsing individual files
安全性相关:如何确保传入的XML文件本身就是XML文件。我的意思是我可以使用扩展，但是有可能在解析这些文件时注入脚本并使它们运行吗?以及在解析单个文件时应该采取哪些步骤

2 个解决方案

#1

You want to validate the XML. This does two things:

您需要验证XML。这两件事:

Make sure it is "well-formed" - a valid XML document
确保它是“格式良好”的——一个有效的XML文档
Make sure it is "valid" - follows a schema, dtd or other definition - it has the elements and you expect to parse.
确保它是“有效的”——遵循模式、dtd或其他定义——它具有元素，您希望对其进行解析。

In php5 the syntax for validating XML documents is:

在php5中，验证XML文档的语法是:

$dom->validate('articles.dtd');
$ dom - >验证(“articles.dtd”);
$dom->relaxNGValidate('articles.rng');
$ dom - > relaxNGValidate(“articles.rng”);
$dom->schemaValidate('articles.xsd');
$ dom - > schemaValidate(“articles.xsd”);

Of course you need an XSD (XML Schema) or DTD (Document Type Definition) to validate against.

当然，您需要一个XSD (XML Schema)或DTD(文档类型定义)来验证。

#2

I can't speak to point 1, but it sounds fairly simple - each file can be parsed completely independently.

我不能说点1，但它听起来相当简单——每个文件都可以完全独立地解析。

Points 2 and 3 are effectively about the contents of the file. Simply put, you can check that it's valid XML by parsing it and asking the parser to validate as it goes, and that's all you need to do. If you're expecting it to follow a particular DTD, you can validate it against that. (There are multiple levels of validation, depending on what your data is.)

点2和3对文件的内容有效。简单地说，您可以通过解析它并要求解析器进行验证来检查它是否有效，这就是您需要做的全部工作。如果您希望它遵循特定的DTD，您可以对它进行验证。(根据您的数据是什么，有多个级别的验证。)

XML files are just data, in and of themselves. While there are "processing instructions" available as XML, they're not instructions in quite the same way as direct bits of script to be executed, and there should be no harm in just parsing the file. Two potential things a malicious file could do:

XML文件本身就是数据。虽然有“处理指令”作为XML可用，但它们与要执行的直接脚本位的方式并不完全相同，解析文件应该没有什么害处。恶意文件可以做两件事:

Try to launch a denial-of-service attack by referring to a huge external DTD, which will make the parser use large amounts of bandwidth. You can probably disable external DTD resolution if you want to guard against this.
尝试通过引用一个巨大的外部DTD来启动拒绝服务攻击，这将使解析器使用大量的带宽。如果您想防止这种情况发生，可以禁用外部DTD解析。
Try to take up significant resources just by being very large. You could always limit the maximum file size your script will handle.
试着利用大量的资源。您总是可以限制脚本将处理的最大文件大小。

#1