How to split text into words?
如何将文字拆分成文字?
Example text:
示例文字:
'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'
“哦,你无能为力,”猫说:“我们都疯了。我生气了。你疯了。'
The words in that line are:
该行中的文字是:
- Oh
- 哦
- you
- 您
- can't
- 不能
- help
- 帮帮我
- that
- 那
- said
- 说过
- the
- 该
- Cat
- 猫
- we're
- 我们
- all
- 所有
- mad
- 狂
- here
- 这里
- I'm
- 我
- mad
- 狂
- You're
- 你是
- mad
- 狂
6 个解决方案
#1
32
Split text on whitespace, then trim punctuation.
在空格上拆分文本,然后修剪标点符号。
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));
Agrees exactly with example.
完全同意例子。
#2
22
First, Remove all special characeters:
首先,删除所有特殊的characeters:
var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better
Then split it:
然后分开它:
var splitted = fixedInput.Split(' ');
For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):
对于更简单的C#解决方案来删除特殊字符(您可以轻松更改),请添加此扩展方法(我添加了对撇号的支持):
public static string RemoveSpecialCharacters(this string str) {
StringBuilder sb = new StringBuilder();
foreach (char c in str) {
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'') {
sb.Append(c);
}
}
return sb.ToString();
}
Then use it like so:
然后像这样使用它:
var words = input.RemoveSpecialCharacters().Split(' ');
You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)
你会惊讶地发现这种扩展方法非常有效(肯定比正则表达式更高效)所以我建议你使用它;)
Update
更新
I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:
我同意这是一种仅限英语的方法,但要使其兼容Unicode,您只需要替换:
(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
With:
附:
char.IsLetter(c)
Which supports Unicode, .Net Also offers you char.IsSymbol
and char.IsLetterOrDigit
for the variety of cases
哪个支持Unicode,。Net还为各种案例提供char.IsSymbol和char.IsLetterOrDigit
#3
6
Just to add a variation on @Adam Fridental's answer which is very good, you could try this Regex:
只是为@Adam Fridental的答案添加一个非常好的答案,你可以尝试这个正则表达式:
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var matches = Regex.Matches(text, @"\w+[^\s]*\w+|\w");
foreach (Match match in matches) {
var word = match.Value;
}
I believe this is the shortest RegEx that will get all the words
我相信这是最短的RegEx,可以得到所有的话
\w+[^\s]*\w+|\w
#4
1
If you don't want to use a Regex object, you could do something like...
如果你不想使用Regex对象,你可以做类似......
string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();
You'll still have to handle the trailing apostrophe at the end of "that,'"
你仍然需要在“那个”结尾处理尾随撇号。
#5
1
This is one of solution, i dont use any helper class or method.
这是解决方案之一,我不使用任何帮助程序类或方法。
public static List<string> ExtractChars(string inputString) {
var result = new List<string>();
int startIndex = -1;
for (int i = 0; i < inputString.Length; i++) {
var character = inputString[i];
if ((character >= 'a' && character <= 'z') ||
(character >= 'A' && character <= 'Z')) {
if (startIndex == -1) {
startIndex = i;
}
if (i == inputString.Length - 1) {
result.Add(GetString(inputString, startIndex, i));
}
continue;
}
if (startIndex != -1) {
result.Add(GetString(inputString, startIndex, i - 1));
startIndex = -1;
}
}
return result;
}
public static string GetString(string inputString, int startIndex, int endIndex) {
string result = "";
for (int i = startIndex; i <= endIndex; i++) {
result += inputString[i];
}
return result;
}
#6
0
You could try using a regex to remove the apostrophes that aren't surrounded by letters (i.e. single quotes) and then using the Char
static methods to strip all the other characters. By calling the regex first you can keep the contraction apostrophes (e.g. can't
) but remove the single quotes like in 'Oh
.
您可以尝试使用正则表达式删除未被字母(即单引号)包围的撇号,然后使用Char静态方法去除所有其他字符。首先调用正则表达式,你可以保留收缩撇号(例如不能),但删除单引号,如'哦。
string myText = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
Regex reg = new Regex("\b[\"']\b");
myText = reg.Replace(myText, "");
string[] listOfWords = RemoveCharacters(myText);
public string[] RemoveCharacters(string input)
{
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
if (Char.IsLetter(c) || Char.IsWhiteSpace(c) || c == '\'')
sb.Append(c);
}
return sb.ToString().Split(' ');
}
#1
32
Split text on whitespace, then trim punctuation.
在空格上拆分文本,然后修剪标点符号。
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));
Agrees exactly with example.
完全同意例子。
#2
22
First, Remove all special characeters:
首先,删除所有特殊的characeters:
var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better
Then split it:
然后分开它:
var splitted = fixedInput.Split(' ');
For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):
对于更简单的C#解决方案来删除特殊字符(您可以轻松更改),请添加此扩展方法(我添加了对撇号的支持):
public static string RemoveSpecialCharacters(this string str) {
StringBuilder sb = new StringBuilder();
foreach (char c in str) {
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'') {
sb.Append(c);
}
}
return sb.ToString();
}
Then use it like so:
然后像这样使用它:
var words = input.RemoveSpecialCharacters().Split(' ');
You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)
你会惊讶地发现这种扩展方法非常有效(肯定比正则表达式更高效)所以我建议你使用它;)
Update
更新
I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:
我同意这是一种仅限英语的方法,但要使其兼容Unicode,您只需要替换:
(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
With:
附:
char.IsLetter(c)
Which supports Unicode, .Net Also offers you char.IsSymbol
and char.IsLetterOrDigit
for the variety of cases
哪个支持Unicode,。Net还为各种案例提供char.IsSymbol和char.IsLetterOrDigit
#3
6
Just to add a variation on @Adam Fridental's answer which is very good, you could try this Regex:
只是为@Adam Fridental的答案添加一个非常好的答案,你可以尝试这个正则表达式:
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var matches = Regex.Matches(text, @"\w+[^\s]*\w+|\w");
foreach (Match match in matches) {
var word = match.Value;
}
I believe this is the shortest RegEx that will get all the words
我相信这是最短的RegEx,可以得到所有的话
\w+[^\s]*\w+|\w
#4
1
If you don't want to use a Regex object, you could do something like...
如果你不想使用Regex对象,你可以做类似......
string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();
You'll still have to handle the trailing apostrophe at the end of "that,'"
你仍然需要在“那个”结尾处理尾随撇号。
#5
1
This is one of solution, i dont use any helper class or method.
这是解决方案之一,我不使用任何帮助程序类或方法。
public static List<string> ExtractChars(string inputString) {
var result = new List<string>();
int startIndex = -1;
for (int i = 0; i < inputString.Length; i++) {
var character = inputString[i];
if ((character >= 'a' && character <= 'z') ||
(character >= 'A' && character <= 'Z')) {
if (startIndex == -1) {
startIndex = i;
}
if (i == inputString.Length - 1) {
result.Add(GetString(inputString, startIndex, i));
}
continue;
}
if (startIndex != -1) {
result.Add(GetString(inputString, startIndex, i - 1));
startIndex = -1;
}
}
return result;
}
public static string GetString(string inputString, int startIndex, int endIndex) {
string result = "";
for (int i = startIndex; i <= endIndex; i++) {
result += inputString[i];
}
return result;
}
#6
0
You could try using a regex to remove the apostrophes that aren't surrounded by letters (i.e. single quotes) and then using the Char
static methods to strip all the other characters. By calling the regex first you can keep the contraction apostrophes (e.g. can't
) but remove the single quotes like in 'Oh
.
您可以尝试使用正则表达式删除未被字母(即单引号)包围的撇号,然后使用Char静态方法去除所有其他字符。首先调用正则表达式,你可以保留收缩撇号(例如不能),但删除单引号,如'哦。
string myText = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
Regex reg = new Regex("\b[\"']\b");
myText = reg.Replace(myText, "");
string[] listOfWords = RemoveCharacters(myText);
public string[] RemoveCharacters(string input)
{
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
if (Char.IsLetter(c) || Char.IsWhiteSpace(c) || c == '\'')
sb.Append(c);
}
return sb.ToString().Split(' ');
}