什么是网络用户最不常用的字符?

时间:2022-07-04 03:05:11

I need this to be used as a delimiter,

我需要将它用作分隔符,

has anyone known about this statistics?

有谁知道这个统计数据?

7 个解决方案

#1


Pick any character, then pick a mechanism to escape that character to handle the case where the user wants to type it. For example, in comma delimited files the comma is the separator:

选择任何字符,然后选择一种机制来转义该字符以处理用户想要键入它的情况。例如,在逗号分隔的文件中,逗号是分隔符:

1,2,fred,john

Unless the data itself contains a comma, then you quote it:

除非数据本身包含逗号,否则您引用它:

1,2,"Bloggs, Fred",john

And if you need use a quote:

如果你需要使用报价:

1,2,"Bloggs, Fred","Jean-Luc \"Make it so\" Picard"

#2


I don't think it matters what character you use, you shouldn't just hope that no-one will type your delimiter. Use a comma and handle the users adding their own commas.

我认为你使用什么角色并不重要,你不应该只希望没有人会输入你的分隔符。使用逗号并处理用户添加自己的逗​​号。

#3


You could prefix whatever data you have on the web with the length.. that's how HTTP-Chunked encoding sends things across the web.

您可以使用长度来为您在Web上拥有的任何数据添加前缀..这就是HTTP-Chunked编码在Web上发送内容的方式。

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

#4


You sound like you're trying to convert a list of strings into a single string in such a manner that you can later turn it back into a list of strings.

您听起来像是在尝试将字符串列表转换为单个字符串,以便稍后可以将其转换回字符串列表。

There are several traditional approaches to this, most of them already mentioned in this thread:

有几种传统的方法,其中大多数已经在这个线程中提到:

  • Use an unusual character as a delimiter, and simply don't allow it in your input, either by rejecting input containing the delimiter, or by replacing the delimiter with "?" or "." or similar. For this, I agree with the person who suggested the vertical bar (|)
    • Advantage: dirt simple to code, in a wide variety of languages
    • 优点:易于编码的污垢,有多种语言版本

    • Disadvantage: You lose some expressiveness and chances for future expansion by eliminating the possibility of input containing your delimiter.
    • 缺点:通过消除包含分隔符的输入的可能性,您将失去一些表现力和未来扩展的机会。

  • 使用一个不常见的字符作为分隔符,并且只是通过拒绝包含分隔符的输入,或者通过用“?”替换分隔符,在输入中不允许它。要么 ”。”或类似的。为此,我同意建议竖条的人(|)优点:简单易于编码,使用各种语言缺点:缺点:通过消除包含分隔符的输入的可能性,您将失去一些表现力和未来扩展的机会。

  • Use a delimiter, and an escape mechanism when the delimiter appears in input. There are actually a few variants to this:
    • The "just like C code" method, where you prepend an escape character to every occurence in your data of your delimiter or your escape character. For example: the string «Greetings,Hey,Hello\,World,Hello \\ Backslash» contains four elements, using , as the delimiter and \ as the escape character. (The last element has one backslash originally)
      • This is actually a royal pain to code and implement correctly in many languages
      • 这实际上是在许多语言中正确编码和实现的王室痛苦

      • Even once you do implement it, it's generally much slower compared to other methods
      • 即使你实现它,它通常比其他方法慢得多

    • “就像C代码”方法,您可以在分隔符数据或转义字符的数据中添加转义字符。例如:字符串«Greetings,Hey,Hello \,World,Hello \\ Backslash»包含四个元素,使用,作为分隔符和\作为转义字符。 (最后一个元素最初只有一个反斜杠)这实际上是在许多语言中正确编码和实现的皇家痛苦即使你实现它,它通常比其他方法慢得多

    • The "like URL parameters" method where your escape mechanism is to convert your delimiter into a multi-character sequence that does not contain your delimiter. You then also need to convert the first character of whatever your delimiter turns into to its own multi-character sequence. For example, if you decided to use , as your delimiter, and decide to represent , as «\1» and \ as «\2», you could write the last example as: «Greetings,Hey,Hello\1World,Hello \2 Backslash»
      • This is usually not too hard to implement. The advantage is that you can do the "splitting" and "unescaping" parts of going from string to list-of-strings in separate steps. The unescaping process might be a tiny bit tricky, since you have to do it as a scan of each string.
      • 这通常不太难实现。优点是您可以在单独的步骤中执行从字符串到字符串列表的“拆分”和“取消转换”部分。 unescaping过程可能有点棘手,因为你必须扫描每个字符串。

    • “like URL parameters”方法,其中您的转义机制是将分隔符转换为不包含分隔符的多字符序列。然后,您还需要将分隔符所转换的任何内容的第一个字符转换为其自己的多字符序列。例如,如果您决定使用(作为分隔符)并决定表示为«\ 1»和\作为«\ 2»,您可以将最后一个示例写为:«问候,嗨,Hello \ 1World,Hello \ 2反斜杠»这通常不太难实现。优点是您可以在单独的步骤中执行从字符串到字符串列表的“拆分”和“取消转换”部分。 unescaping过程可能有点棘手,因为你必须扫描每个字符串。

    • Like CSV files, with quotes around items that contain your delimiter, and the quotes escaped according to some obscure mechanism. (Such as by doubling)
      • Avoid this unless you can just throw it at a pre-existing library.
      • 除非您可以将它扔到预先存在的库中,否则请避免这种情况。

      • This has all the disadvantages of the "Like C code" method, plus extra confusing state to screw up when implementing it.
      • 这具有“Like C code”方法的所有缺点,加上在实现它时搞砸的额外混乱状态。

    • 与CSV文件一样,在包含分隔符的项目周围加引号,并根据一些不明确的机制转义引号。 (例如加倍)除非你可以把它扔到预先存在的库中,否则避免这种情况。这具有“Like C code”方法的所有缺点,加上在实现它时搞砸的额外混乱状态。

    • One of the above methods, but with a multi-character delimiter. This is harder than you'd think; the extra characters actually significantly complicate the logic of what exactly should be escaped.
    • 上述方法之一,但具有多字符分隔符。这比你想象的要难;额外的字符实际上显着地复杂了应该被转义的逻辑。

  • 当分隔符出现在输入中时,使用分隔符和转义机制。实际上有一些变体:“就像C代码”方法,在你的分隔符或转义字符的数据中,每个出现一个转义字符。例如:字符串«Greetings,Hey,Hello \,World,Hello \\ Backslash»包含四个元素,使用,作为分隔符和\作为转义字符。 (最后一个元素最初只有一个反斜杠)这实际上是在许多语言中正确编码和实现的皇家痛苦即使你实现它,它通常比其他方法慢得多“你喜欢的URL参数”方法,你的转义机制是将分隔符转换为不包含分隔符的多字符序列。然后,您还需要将分隔符所转换的任何内容的第一个字符转换为其自己的多字符序列。例如,如果您决定使用(作为分隔符)并决定表示为«\ 1»和\作为«\ 2»,您可以将最后一个示例写为:«问候,嗨,Hello \ 1World,Hello \ 2反斜杠»这通常不太难实现。优点是您可以在单独的步骤中执行从字符串到字符串列表的“拆分”和“取消转换”部分。 unescaping过程可能有点棘手,因为你必须扫描每个字符串。与CSV文件一样,在包含分隔符的项目周围加引号,并根据一些不明确的机制转义引号。 (例如加倍)除非你可以把它扔到预先存在的库中,否则避免这种情况。这具有“Like C code”方法的所有缺点,加上在实现它时搞砸的额外混乱状态。上述方法之一,但具有多字符分隔符。这比你想象的要难;额外的字符实际上显着地复杂了应该被转义的逻辑。

  • Prefix each item with its length, then include the item unchanged
    • This is used by HTTP in its "Chunked" encoding, by bencoding (the wire format bittorrent uses), and by Google's protocol buffers.
    • 这是由HTTP在其“Chunked”编码中使用,通过bencoding(有线格式bittorrent使用)和Google的协议缓冲区。

    • Implementing this can be a tiny bit tricky, and is very prone to off-by-one errors. I still think it's easier to implement than the "like C code" method, especially in low-level languages.
    • 实现这一点可能有点棘手,并且非常容易出现一个错误。我仍然认为它比“类似C代码”方法更容易实现,特别是在低级语言中。

    • Once you do implement it correctly, it's generally much faster than the other schemes, even the lossy scheme that just forbids input containing the delimiter. (The exception is if you're working in a high-level language that has a built-in "split" routine)
    • 一旦你正确实现它,它通常比其他方案快得多,即使是仅禁止包含分隔符的输入的有损方案。 (例外情况是,如果您使用的是具有内置“拆分”例程的高级语言)

  • 为每个项目添加其长度前缀,然后包括未更改的项目。这由HTTP在其“Chunked”编码中使用,通过bencoding(有线格式bittorrent使用)和Google的协议缓冲区使用。实现这一点可能有点棘手,并且非常容易出现一个错误。我仍然认为它比“类似C代码”方法更容易实现,特别是在低级语言中。一旦你正确实现它,它通常比其他方案快得多,即使是仅禁止包含分隔符的输入的有损方案。 (例外情况是,如果您使用的是具有内置“拆分”例程的高级语言)

#5


What about using a string of characters as delimiter?

使用一串字符作为分隔符怎么样?

#6


In such cases, I like the use the vertical bar | character.

在这种情况下,我喜欢使用竖条|字符。

  • It's easy to spot when looking at a text file.
  • 在查看文本文件时很容易发现。

  • It clearly marks a separation.
  • 它显然标志着分离。

  • It's rarely used.
  • 它很少使用。

  • And, since it has no intrinsic meaning in English grammar, it is easy to either just disallow it or blindly change it to something else (like a dash) if it appears in the column text.
  • 而且,由于它在英语语法中没有内在意义,如果它出现在列文本中,很容易就是不允许它或盲目地将其改为其他东西(如破折号)。

#7


I'm sure there are tons of strange unicode characters that don't get used much, but that's probably not what you're looking for.

我敢肯定有很多奇怪的unicode角色没有得到太多的使用,但这可能不是你想要的。

Why do you want something "rare" for a delimiter? How will it be used?

为什么你想要一个“稀有”的分隔符?它将如何使用?

#1


Pick any character, then pick a mechanism to escape that character to handle the case where the user wants to type it. For example, in comma delimited files the comma is the separator:

选择任何字符,然后选择一种机制来转义该字符以处理用户想要键入它的情况。例如,在逗号分隔的文件中,逗号是分隔符:

1,2,fred,john

Unless the data itself contains a comma, then you quote it:

除非数据本身包含逗号,否则您引用它:

1,2,"Bloggs, Fred",john

And if you need use a quote:

如果你需要使用报价:

1,2,"Bloggs, Fred","Jean-Luc \"Make it so\" Picard"

#2


I don't think it matters what character you use, you shouldn't just hope that no-one will type your delimiter. Use a comma and handle the users adding their own commas.

我认为你使用什么角色并不重要,你不应该只希望没有人会输入你的分隔符。使用逗号并处理用户添加自己的逗​​号。

#3


You could prefix whatever data you have on the web with the length.. that's how HTTP-Chunked encoding sends things across the web.

您可以使用长度来为您在Web上拥有的任何数据添加前缀..这就是HTTP-Chunked编码在Web上发送内容的方式。

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html

#4


You sound like you're trying to convert a list of strings into a single string in such a manner that you can later turn it back into a list of strings.

您听起来像是在尝试将字符串列表转换为单个字符串,以便稍后可以将其转换回字符串列表。

There are several traditional approaches to this, most of them already mentioned in this thread:

有几种传统的方法,其中大多数已经在这个线程中提到:

  • Use an unusual character as a delimiter, and simply don't allow it in your input, either by rejecting input containing the delimiter, or by replacing the delimiter with "?" or "." or similar. For this, I agree with the person who suggested the vertical bar (|)
    • Advantage: dirt simple to code, in a wide variety of languages
    • 优点:易于编码的污垢,有多种语言版本

    • Disadvantage: You lose some expressiveness and chances for future expansion by eliminating the possibility of input containing your delimiter.
    • 缺点:通过消除包含分隔符的输入的可能性,您将失去一些表现力和未来扩展的机会。

  • 使用一个不常见的字符作为分隔符,并且只是通过拒绝包含分隔符的输入,或者通过用“?”替换分隔符,在输入中不允许它。要么 ”。”或类似的。为此,我同意建议竖条的人(|)优点:简单易于编码,使用各种语言缺点:缺点:通过消除包含分隔符的输入的可能性,您将失去一些表现力和未来扩展的机会。

  • Use a delimiter, and an escape mechanism when the delimiter appears in input. There are actually a few variants to this:
    • The "just like C code" method, where you prepend an escape character to every occurence in your data of your delimiter or your escape character. For example: the string «Greetings,Hey,Hello\,World,Hello \\ Backslash» contains four elements, using , as the delimiter and \ as the escape character. (The last element has one backslash originally)
      • This is actually a royal pain to code and implement correctly in many languages
      • 这实际上是在许多语言中正确编码和实现的王室痛苦

      • Even once you do implement it, it's generally much slower compared to other methods
      • 即使你实现它,它通常比其他方法慢得多

    • “就像C代码”方法,您可以在分隔符数据或转义字符的数据中添加转义字符。例如:字符串«Greetings,Hey,Hello \,World,Hello \\ Backslash»包含四个元素,使用,作为分隔符和\作为转义字符。 (最后一个元素最初只有一个反斜杠)这实际上是在许多语言中正确编码和实现的皇家痛苦即使你实现它,它通常比其他方法慢得多

    • The "like URL parameters" method where your escape mechanism is to convert your delimiter into a multi-character sequence that does not contain your delimiter. You then also need to convert the first character of whatever your delimiter turns into to its own multi-character sequence. For example, if you decided to use , as your delimiter, and decide to represent , as «\1» and \ as «\2», you could write the last example as: «Greetings,Hey,Hello\1World,Hello \2 Backslash»
      • This is usually not too hard to implement. The advantage is that you can do the "splitting" and "unescaping" parts of going from string to list-of-strings in separate steps. The unescaping process might be a tiny bit tricky, since you have to do it as a scan of each string.
      • 这通常不太难实现。优点是您可以在单独的步骤中执行从字符串到字符串列表的“拆分”和“取消转换”部分。 unescaping过程可能有点棘手,因为你必须扫描每个字符串。

    • “like URL parameters”方法,其中您的转义机制是将分隔符转换为不包含分隔符的多字符序列。然后,您还需要将分隔符所转换的任何内容的第一个字符转换为其自己的多字符序列。例如,如果您决定使用(作为分隔符)并决定表示为«\ 1»和\作为«\ 2»,您可以将最后一个示例写为:«问候,嗨,Hello \ 1World,Hello \ 2反斜杠»这通常不太难实现。优点是您可以在单独的步骤中执行从字符串到字符串列表的“拆分”和“取消转换”部分。 unescaping过程可能有点棘手,因为你必须扫描每个字符串。

    • Like CSV files, with quotes around items that contain your delimiter, and the quotes escaped according to some obscure mechanism. (Such as by doubling)
      • Avoid this unless you can just throw it at a pre-existing library.
      • 除非您可以将它扔到预先存在的库中,否则请避免这种情况。

      • This has all the disadvantages of the "Like C code" method, plus extra confusing state to screw up when implementing it.
      • 这具有“Like C code”方法的所有缺点,加上在实现它时搞砸的额外混乱状态。

    • 与CSV文件一样,在包含分隔符的项目周围加引号,并根据一些不明确的机制转义引号。 (例如加倍)除非你可以把它扔到预先存在的库中,否则避免这种情况。这具有“Like C code”方法的所有缺点,加上在实现它时搞砸的额外混乱状态。

    • One of the above methods, but with a multi-character delimiter. This is harder than you'd think; the extra characters actually significantly complicate the logic of what exactly should be escaped.
    • 上述方法之一,但具有多字符分隔符。这比你想象的要难;额外的字符实际上显着地复杂了应该被转义的逻辑。

  • 当分隔符出现在输入中时,使用分隔符和转义机制。实际上有一些变体:“就像C代码”方法,在你的分隔符或转义字符的数据中,每个出现一个转义字符。例如:字符串«Greetings,Hey,Hello \,World,Hello \\ Backslash»包含四个元素,使用,作为分隔符和\作为转义字符。 (最后一个元素最初只有一个反斜杠)这实际上是在许多语言中正确编码和实现的皇家痛苦即使你实现它,它通常比其他方法慢得多“你喜欢的URL参数”方法,你的转义机制是将分隔符转换为不包含分隔符的多字符序列。然后,您还需要将分隔符所转换的任何内容的第一个字符转换为其自己的多字符序列。例如,如果您决定使用(作为分隔符)并决定表示为«\ 1»和\作为«\ 2»,您可以将最后一个示例写为:«问候,嗨,Hello \ 1World,Hello \ 2反斜杠»这通常不太难实现。优点是您可以在单独的步骤中执行从字符串到字符串列表的“拆分”和“取消转换”部分。 unescaping过程可能有点棘手,因为你必须扫描每个字符串。与CSV文件一样,在包含分隔符的项目周围加引号,并根据一些不明确的机制转义引号。 (例如加倍)除非你可以把它扔到预先存在的库中,否则避免这种情况。这具有“Like C code”方法的所有缺点,加上在实现它时搞砸的额外混乱状态。上述方法之一,但具有多字符分隔符。这比你想象的要难;额外的字符实际上显着地复杂了应该被转义的逻辑。

  • Prefix each item with its length, then include the item unchanged
    • This is used by HTTP in its "Chunked" encoding, by bencoding (the wire format bittorrent uses), and by Google's protocol buffers.
    • 这是由HTTP在其“Chunked”编码中使用,通过bencoding(有线格式bittorrent使用)和Google的协议缓冲区。

    • Implementing this can be a tiny bit tricky, and is very prone to off-by-one errors. I still think it's easier to implement than the "like C code" method, especially in low-level languages.
    • 实现这一点可能有点棘手,并且非常容易出现一个错误。我仍然认为它比“类似C代码”方法更容易实现,特别是在低级语言中。

    • Once you do implement it correctly, it's generally much faster than the other schemes, even the lossy scheme that just forbids input containing the delimiter. (The exception is if you're working in a high-level language that has a built-in "split" routine)
    • 一旦你正确实现它,它通常比其他方案快得多,即使是仅禁止包含分隔符的输入的有损方案。 (例外情况是,如果您使用的是具有内置“拆分”例程的高级语言)

  • 为每个项目添加其长度前缀,然后包括未更改的项目。这由HTTP在其“Chunked”编码中使用,通过bencoding(有线格式bittorrent使用)和Google的协议缓冲区使用。实现这一点可能有点棘手,并且非常容易出现一个错误。我仍然认为它比“类似C代码”方法更容易实现,特别是在低级语言中。一旦你正确实现它,它通常比其他方案快得多,即使是仅禁止包含分隔符的输入的有损方案。 (例外情况是,如果您使用的是具有内置“拆分”例程的高级语言)

#5


What about using a string of characters as delimiter?

使用一串字符作为分隔符怎么样?

#6


In such cases, I like the use the vertical bar | character.

在这种情况下,我喜欢使用竖条|字符。

  • It's easy to spot when looking at a text file.
  • 在查看文本文件时很容易发现。

  • It clearly marks a separation.
  • 它显然标志着分离。

  • It's rarely used.
  • 它很少使用。

  • And, since it has no intrinsic meaning in English grammar, it is easy to either just disallow it or blindly change it to something else (like a dash) if it appears in the column text.
  • 而且,由于它在英语语法中没有内在意义,如果它出现在列文本中,很容易就是不允许它或盲目地将其改为其他东西(如破折号)。

#7


I'm sure there are tons of strange unicode characters that don't get used much, but that's probably not what you're looking for.

我敢肯定有很多奇怪的unicode角色没有得到太多的使用,但这可能不是你想要的。

Why do you want something "rare" for a delimiter? How will it be used?

为什么你想要一个“稀有”的分隔符?它将如何使用?