我应该为文件名转义/清理哪些字符?

时间:2022-09-20 10:51:06

I need to sanitize some data which will be used in file names. Some of the data contains spaces and ampersand characters. Is there a function which will escape or sanitize data suitable for using in a file name (or path)? I couldn't find one in the 'Filesystem Function' section of the PHP manual.

我需要清理一些将在文件名中使用的数据。某些数据包含空格和符号字符。是否有一个函数可以转义或清理适合在文件名(或路径)中使用的数据?我在PHP手册的“文件系统功能”部分找不到一个。

So, assuming I have to write my own function, which characters do I need to escape (or change)?

所以,假设我必须编写自己的函数,我需要逃避(或更改)哪些字符?

7 个解决方案

#1


If you have the opportunity to store the original name in a database I would simply create a file with a random hash (mt_rand()/md5/sha1). The benefit would be that you don't rely on the underlying OS (characters/path length), the value or the length of the user input and additionally it is really hard to guess/forge a file name. Maybe even a base64 encoding is an option.

如果您有机会将原始名称存储在数据库中,我只需创建一个带有随机哈希的文件(mt_rand()/ md5 / sha1)。好处是您不依赖于底层操作系统(字符/路径长度),用户输入的值或长度,而且很难猜测/伪造文件名。也许甚至base64编码也是一种选择。

#2


For Windows:

/ \ : * ? " < > |

For Unix, technically nothing, but in practice the same list as Windows would be sensible.

对于Unix,技术上没什么,但实际上与Windows相同的列表是明智的。

There's nothing wrong with spaces or ampersands as long as you're prepared to use quotes on command lines when you're manipulating the files.

只要您准备在操作文件时在命令行上使用引号,空格或&符号就没有错。

(BTW, I got that list by trying to rename a file on Windows to something including a colon, and copying from the error message.)

(顺便说一句,我通过尝试将Windows上的文件重命名为包含冒号的内容,并从错误消息中复制来获取该列表。)

#3


Instead of filtering out characters why not just allow [a-z0-9- !@#$%^()]? It is certainly easier than trying to guess every character that could potentially cause problems.

而不是过滤字符为什么不只是允许[a-z0-9-!@#$%^()]?这比猜测每个可能导致问题的角色更容易。

Your users shouldn't need a file with any other characters anyways, right?

您的用户不应该需要包含任何其他字符的文件,对吧?

#4


It might be a good idea to remove everything outside [a-z0-9_\-.]. It's not necessary to be this strict, but it's comfortable to have a directory listing without any surprises. If you're working with some weird character sets, then you maybe want to convert the encoding to flat ascii before removing the offending characters (or you might end up with deleting everything) ...

删除[a-z0-9 _ \ - 。]之外的所有内容可能是个好主意。没有必要这么严格,但是有一个目录列表没有任何意外,这很舒服。如果您正在使用一些奇怪的字符集,那么您可能希望在删除有问题的字符之前将编码转换为平面ascii(或者您最终可能会删除所有内容)...

at least that's how i do it :-)

至少那是我怎么做的:-)

#5


When sanitizing strings for filenames, we filter out all characters below 0x20, as well as <, >, :, ", /, \, |, ?, and *

清理文件名的字符串时,我们会过滤掉0x20以下的所有字符,以及<,>,:,“,/,\,|,?和*

#6


For Windows, add "&" to the list, if you don't want -any- side-effects. This is the character which says "the next character is my hotkey" in some displays of data. (Most common in old Windows, but still pops up here and there.) So instead of "M & M" you'd see "M _M" ... the character following the ampersand (a space) is a "hotkey", and thus underlined.

对于Windows,如果您不想要-any-副作用,请在列表中添加“&”。这是在某些数据显示中表示“下一个字符是我的热键”的字符。 (在旧的Windows中最常见,但仍会在这里和那里弹出。)因此,而不是“M&M”,你会看到“M _M”......&符号后面的字符(空格)是“热键”,并因此强调。

#7


Implementation of @merkuro answer:

执行@merkuro回答:

function getSafeFilesystemFileName() {
    return (
        md5($id . '-' . $filename) .
        '.' . pathinfo($filename, PATHINFO_EXTENSION)
    );
}

Where:

  • $id is the record ID from the database
  • $ id是数据库中的记录ID

  • $filename is the original upload's filename (also stored in the record)
  • $ filename是原始上传的文件名(也存储在记录中)

One important thing: append the original extension onto the generated file. If you ever need to give the file to a tool that cares about the extension, it will be much easier to have it available than to have to create a temporary file with the extension.

一件重要的事情:将原始扩展名附加到生成的文件上。如果您需要将文件提供给关注扩展的工具,那么让它可用起来比创建带扩展名的临时文件要容易得多。

#1


If you have the opportunity to store the original name in a database I would simply create a file with a random hash (mt_rand()/md5/sha1). The benefit would be that you don't rely on the underlying OS (characters/path length), the value or the length of the user input and additionally it is really hard to guess/forge a file name. Maybe even a base64 encoding is an option.

如果您有机会将原始名称存储在数据库中,我只需创建一个带有随机哈希的文件(mt_rand()/ md5 / sha1)。好处是您不依赖于底层操作系统(字符/路径长度),用户输入的值或长度,而且很难猜测/伪造文件名。也许甚至base64编码也是一种选择。

#2


For Windows:

/ \ : * ? " < > |

For Unix, technically nothing, but in practice the same list as Windows would be sensible.

对于Unix,技术上没什么,但实际上与Windows相同的列表是明智的。

There's nothing wrong with spaces or ampersands as long as you're prepared to use quotes on command lines when you're manipulating the files.

只要您准备在操作文件时在命令行上使用引号,空格或&符号就没有错。

(BTW, I got that list by trying to rename a file on Windows to something including a colon, and copying from the error message.)

(顺便说一句,我通过尝试将Windows上的文件重命名为包含冒号的内容,并从错误消息中复制来获取该列表。)

#3


Instead of filtering out characters why not just allow [a-z0-9- !@#$%^()]? It is certainly easier than trying to guess every character that could potentially cause problems.

而不是过滤字符为什么不只是允许[a-z0-9-!@#$%^()]?这比猜测每个可能导致问题的角色更容易。

Your users shouldn't need a file with any other characters anyways, right?

您的用户不应该需要包含任何其他字符的文件,对吧?

#4


It might be a good idea to remove everything outside [a-z0-9_\-.]. It's not necessary to be this strict, but it's comfortable to have a directory listing without any surprises. If you're working with some weird character sets, then you maybe want to convert the encoding to flat ascii before removing the offending characters (or you might end up with deleting everything) ...

删除[a-z0-9 _ \ - 。]之外的所有内容可能是个好主意。没有必要这么严格,但是有一个目录列表没有任何意外,这很舒服。如果您正在使用一些奇怪的字符集,那么您可能希望在删除有问题的字符之前将编码转换为平面ascii(或者您最终可能会删除所有内容)...

at least that's how i do it :-)

至少那是我怎么做的:-)

#5


When sanitizing strings for filenames, we filter out all characters below 0x20, as well as <, >, :, ", /, \, |, ?, and *

清理文件名的字符串时,我们会过滤掉0x20以下的所有字符,以及<,>,:,“,/,\,|,?和*

#6


For Windows, add "&" to the list, if you don't want -any- side-effects. This is the character which says "the next character is my hotkey" in some displays of data. (Most common in old Windows, but still pops up here and there.) So instead of "M & M" you'd see "M _M" ... the character following the ampersand (a space) is a "hotkey", and thus underlined.

对于Windows,如果您不想要-any-副作用,请在列表中添加“&”。这是在某些数据显示中表示“下一个字符是我的热键”的字符。 (在旧的Windows中最常见,但仍会在这里和那里弹出。)因此,而不是“M&M”,你会看到“M _M”......&符号后面的字符(空格)是“热键”,并因此强调。

#7


Implementation of @merkuro answer:

执行@merkuro回答:

function getSafeFilesystemFileName() {
    return (
        md5($id . '-' . $filename) .
        '.' . pathinfo($filename, PATHINFO_EXTENSION)
    );
}

Where:

  • $id is the record ID from the database
  • $ id是数据库中的记录ID

  • $filename is the original upload's filename (also stored in the record)
  • $ filename是原始上传的文件名(也存储在记录中)

One important thing: append the original extension onto the generated file. If you ever need to give the file to a tool that cares about the extension, it will be much easier to have it available than to have to create a temporary file with the extension.

一件重要的事情:将原始扩展名附加到生成的文件上。如果您需要将文件提供给关注扩展的工具,那么让它可用起来比创建带扩展名的临时文件要容易得多。