将字符串分割成最大字符数的块,而不破坏单词

时间:2022-01-31 02:49:56

I want to split a string into chunks, each of which is within a maximum character count, say 2000 and does not split a word.

我想将一个字符串分割成块,每个块都在一个最大字符数内,比如2000,并且不拆分一个单词。

I have tried doing as below:

我试过以下方法:

text.chars.each_slice(2000).map(&:join)

but sometimes, words are split. I have tried some regex:

但有时,语言是分裂的。我试过一些regex:

text.scan(/.{1,2000}\b|.{1,2000}/).map(&:strip)

from this question, but I don't quite get how it works and it gives me some erratic behavior, sometimes giving chunks that only contain periods.

从这个问题中,我不太明白它是如何工作的,它给了我一些不稳定的行为,有时只给出包含周期的块。

Any pointers will be greatly appreciated.

非常感谢您的任何指示。

3 个解决方案

#1


1  

You could do a Notepad style word wrap.
Just construct the regex using the maximum characters per line quantifier range {1,N}.

你可以做一个记事本风格的词包。只需使用每行量词范围{1,N}的最大字符构造regex。

The example below uses 32 max per line.

下面的示例每行最多使用32。

https://regex101.com/r/8vAkOX/1

https://regex101.com/r/8vAkOX/1

Update: To include linebreaks within the range, add the dot-all modifier (?s)
Otherwise, stand alone linebreaks are filtered.

更新:在范围内包含换行符,添加dot-all修饰符(?s),独立的linebreak被过滤。

(?s)(?:((?>.{1,32}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,32})(?:\r?\n)?|(?:\r?\n|$))

(?)(?:((? >。{ 1,32 }(?(? < =[^ \ s \ r \ n])[^ \ s \ r \ n]? |(? = \ r \ n)| |美元[^ \ s \ r \ n]))| 32 { 1,})(?:\ r ? \ n)? |(?:\ r ? \ n | $))

The chunks are in $1, and you could replace with $1\r\n to get a display
that looks wrapped.

这些块的价格是1美元,你可以用$1\r\n替换,以得到看起来包装好的显示器。

Explained

解释

 (?s) # Span line breaks
 (?:
      # -- Words/Characters 
      (                       # (1 start)
           (?>                     # Atomic Group - Match words with valid breaks
                .{1,32}                 #  1-N characters
                                        #  Followed by one of 4 prioritized, non-linebreak whitespace
                (?:                     #  break types:
                     (?<= [^\S\r\n] )        # 1. - Behind a non-linebreak whitespace
                     [^\S\r\n]?              #      ( optionally accept an extra non-linebreak whitespace )
                  |  (?= \r? \n )            # 2. - Ahead a linebreak
                  |  $                       # 3. - EOS
                  |  [^\S\r\n]               # 4. - Accept an extra non-linebreak whitespace
                )
           )                       # End atomic group
        |  
           .{1,32}                 # No valid word breaks, just break on the N'th character
      )                       # (1 end)
      (?: \r? \n )?           # Optional linebreak after Words/Characters
   |  
      # -- Or, Linebreak
      (?: \r? \n | $ )        # Stand alone linebreak or at EOS
 )

#2


1  

Code

代码

def max_groups(str, n)
  arr = []
  pos = 0     
  loop do
    break arr if pos == str.size
    m = str.match(/.{1,#{n}}(?=[ ]|\z)|.{,#{n-1}}[ ]/, pos)
    return nil if m.nil?
    arr << m[0]
    pos += m[0].size
  end
end

Examples

例子

str = "Now is the time for all good people to party"
  #    12345678901234567890123456789012345678901234
  #    0         1         2         3         4

max_groups(str, 5)
  #=> nil
max_groups(str, 6)
  #=> ["Now is", " the ", "time ", "for ", "all ", "good ", "people", " to 
max_groups(str, 10)
  #=> ["Now is the", " time for ", "all good ", "people to ", "party"]
max_groups(str, 14)
  #=> ["Now is the ", "time for all ", "good people to", " party"]
max_groups(str, 15)
  #=> ["Now is the time", " for all good ", "people to party"]
max_groups(str, 29)
  #=> ["Now is the time for all good ", "people to party"]
max_groups(str, 43)
  #=> ["Now is the time for all good people to ", "party"]
max_groups(str, 44)
  #=> ["Now is the time for all good people to party"]

str = "How        you do?"
  #    123456789012345678
  #    0         1

max_groups(str, 4)
  #=> ["How ", "    ", "   ", "you ", "do?"]

#3


0  

This is what worked for me (thanks to @StefanPochmann's comments):

这就是对我起作用的地方(感谢@StefanPochmann的评论):

text = "Some really long string\nwith some line breaks"

The following will first remove all whitespace before breaking the string up.

下面将首先删除所有空格,然后再拆分字符串。

text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)

The resulting chunks of strings will lose all the line breaks (\n) from the original string. If you need to maintain the line breaks, you need to replace them all with some random placeholder (before applying the regex), for example: (br), that you can use to restore the line breaks later. Like this:

产生的字符串块将丢失来自原始字符串的所有断行(\n)。如果需要维护换行符,则需要使用一些随机占位符(在应用regex之前)替换它们,例如:(br),稍后您可以使用这些占位符来恢复换行符。是这样的:

text = "Some really long string\nwith some line breaks".gsub("\n", "(br)")

After we run the regex, we can restore the line breaks for the new chunks by replacing all occurrences of (br) with \n like this:

在运行regex之后,我们可以通过将所有(br)的出现替换为如下所示的\n来恢复新块的换行符:

chunks = text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
chunks.each{|chunk| chunk.gsub!('(br)', "\n")}

Looks like a long process but it worked for me.

看起来这是一个漫长的过程,但对我来说很有用。

#1


1  

You could do a Notepad style word wrap.
Just construct the regex using the maximum characters per line quantifier range {1,N}.

你可以做一个记事本风格的词包。只需使用每行量词范围{1,N}的最大字符构造regex。

The example below uses 32 max per line.

下面的示例每行最多使用32。

https://regex101.com/r/8vAkOX/1

https://regex101.com/r/8vAkOX/1

Update: To include linebreaks within the range, add the dot-all modifier (?s)
Otherwise, stand alone linebreaks are filtered.

更新:在范围内包含换行符,添加dot-all修饰符(?s),独立的linebreak被过滤。

(?s)(?:((?>.{1,32}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,32})(?:\r?\n)?|(?:\r?\n|$))

(?)(?:((? >。{ 1,32 }(?(? < =[^ \ s \ r \ n])[^ \ s \ r \ n]? |(? = \ r \ n)| |美元[^ \ s \ r \ n]))| 32 { 1,})(?:\ r ? \ n)? |(?:\ r ? \ n | $))

The chunks are in $1, and you could replace with $1\r\n to get a display
that looks wrapped.

这些块的价格是1美元,你可以用$1\r\n替换,以得到看起来包装好的显示器。

Explained

解释

 (?s) # Span line breaks
 (?:
      # -- Words/Characters 
      (                       # (1 start)
           (?>                     # Atomic Group - Match words with valid breaks
                .{1,32}                 #  1-N characters
                                        #  Followed by one of 4 prioritized, non-linebreak whitespace
                (?:                     #  break types:
                     (?<= [^\S\r\n] )        # 1. - Behind a non-linebreak whitespace
                     [^\S\r\n]?              #      ( optionally accept an extra non-linebreak whitespace )
                  |  (?= \r? \n )            # 2. - Ahead a linebreak
                  |  $                       # 3. - EOS
                  |  [^\S\r\n]               # 4. - Accept an extra non-linebreak whitespace
                )
           )                       # End atomic group
        |  
           .{1,32}                 # No valid word breaks, just break on the N'th character
      )                       # (1 end)
      (?: \r? \n )?           # Optional linebreak after Words/Characters
   |  
      # -- Or, Linebreak
      (?: \r? \n | $ )        # Stand alone linebreak or at EOS
 )

#2


1  

Code

代码

def max_groups(str, n)
  arr = []
  pos = 0     
  loop do
    break arr if pos == str.size
    m = str.match(/.{1,#{n}}(?=[ ]|\z)|.{,#{n-1}}[ ]/, pos)
    return nil if m.nil?
    arr << m[0]
    pos += m[0].size
  end
end

Examples

例子

str = "Now is the time for all good people to party"
  #    12345678901234567890123456789012345678901234
  #    0         1         2         3         4

max_groups(str, 5)
  #=> nil
max_groups(str, 6)
  #=> ["Now is", " the ", "time ", "for ", "all ", "good ", "people", " to 
max_groups(str, 10)
  #=> ["Now is the", " time for ", "all good ", "people to ", "party"]
max_groups(str, 14)
  #=> ["Now is the ", "time for all ", "good people to", " party"]
max_groups(str, 15)
  #=> ["Now is the time", " for all good ", "people to party"]
max_groups(str, 29)
  #=> ["Now is the time for all good ", "people to party"]
max_groups(str, 43)
  #=> ["Now is the time for all good people to ", "party"]
max_groups(str, 44)
  #=> ["Now is the time for all good people to party"]

str = "How        you do?"
  #    123456789012345678
  #    0         1

max_groups(str, 4)
  #=> ["How ", "    ", "   ", "you ", "do?"]

#3


0  

This is what worked for me (thanks to @StefanPochmann's comments):

这就是对我起作用的地方(感谢@StefanPochmann的评论):

text = "Some really long string\nwith some line breaks"

The following will first remove all whitespace before breaking the string up.

下面将首先删除所有空格,然后再拆分字符串。

text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)

The resulting chunks of strings will lose all the line breaks (\n) from the original string. If you need to maintain the line breaks, you need to replace them all with some random placeholder (before applying the regex), for example: (br), that you can use to restore the line breaks later. Like this:

产生的字符串块将丢失来自原始字符串的所有断行(\n)。如果需要维护换行符,则需要使用一些随机占位符(在应用regex之前)替换它们,例如:(br),稍后您可以使用这些占位符来恢复换行符。是这样的:

text = "Some really long string\nwith some line breaks".gsub("\n", "(br)")

After we run the regex, we can restore the line breaks for the new chunks by replacing all occurrences of (br) with \n like this:

在运行regex之后,我们可以通过将所有(br)的出现替换为如下所示的\n来恢复新块的换行符:

chunks = text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
chunks.each{|chunk| chunk.gsub!('(br)', "\n")}

Looks like a long process but it worked for me.

看起来这是一个漫长的过程,但对我来说很有用。