更有效的方法来确定字符串是否以一组令牌中的令牌开头?

时间:2022-07-17 09:35:22

I'm currently doing something like this in some code I'm working on right now:

我现在正在做一些我现在正在做的代码:

public CommandType GetCommandTypeFromCommandString(String command)
{
   if(command.StartsWith(CommandConstants.Acknowledge))
      return CommandType.Acknowledge;
   else if (command.StartsWith(CommandConstants.Status))
      return CommandType.Status;
   else if (command.StartsWith(CommandConstants.Echo))
      return CommandType.Echo;
   else if (command.StartsWith(CommandConstants.Warning))
     return CommandType.Warning;
     // and so on

   return CommandType.None;
}

I'd like to know if there is a more efficient way of doing this in C#. This code needs to execute many, many times a second, and I'm not too happy with the time it takes to do all of those string comparisons. Any suggestions? :)

我想知道在C#中是否有更有效的方法。这段代码需要每秒执行很多次,而且我对完成所有这些字符串比较所花费的时间不太满意。有什么建议? :)

11 个解决方案

#1


One optimization would be to use the StringComparison enum to specify that you only want ordinal comparison. Like this:

一种优化方法是使用StringComparison枚举来指定您只需要进行序数比较。像这样:

if(command.StartsWith(CommandConstants.Acknowledge, StringComparison.Ordinal))
    return CommandType.Acknowledge;

If you don't specify a string comparison method the current culture will be used for comparison and that slows things down a bit.

如果您没有指定字符串比较方法,那么当前文化将用于比较,这会使事情变慢。

I did some (really really naive) benchmarking:

我做了一些(真的很天真)基准测试:

var a = "foo bar foo";
var b = "foo";

int numTimes = 1000000;

Benchmark.Time(() => a.StartsWith(b, StringComparison.Ordinal), "ordinal", numTimes);
Benchmark.Time(() => a.StartsWith(b), "culture sensitive", numTimes);

Which produced the following results:

其中产生了以下结果:

ordinal ran 1000000 times in 35.6033 ms
culture sensitive ran 1000000 times in 175.5859 ms

You should also order your comparisons so that the most likely tokens are being compared first (happy path).

您还应该对比较进行排序,以便首先比较最可能的标记(快乐路径)。

Theese optimizations are a simple way of making your current implementation perform better but if performance really is critical (and I mean really critical) you should be looking towards implementing some sort of state machine.

Theese优化是一种简单的方法,可以使您当前的实现更好,但如果性能真的很关键(我的意思是非常关键),那么您应该考虑实现某种状态机。

#2


Similar in concept to the FSM answer by Vojislav, you could try putting the constants into a trie. You can then replace your sequence of comparisons with a single traversal of the trie.

在概念上与Vojislav的FSM答案类似,你可以尝试将常数放入trie中。然后,您可以使用trie的单次遍历替换您的比较序列。

There is a C# trie implementation described here.

这里描述了一个C#trie实现。

#3


It's quite a bit of work, but you could construct an FSM based on each of the tokens. The FSM would accept characters from the command string one by one; it would have one final state for each token and an additional final state to which to go when the command doesn't start with any of the tokens.

这是相当多的工作,但你可以根据每个令牌构建一个FSM。 FSM将逐个接受命令字符串中的字符;对于每个令牌,它将具有一个最终状态,并且当命令不以任何令牌开始时,将具有另外的最终状态。

#4


I think you could do better with a regular expression and a Dictionary:

我认为你可以用正则表达式和词典做得更好:

static Regex reCommands = new Regex("^(cmd1|cmd2|cmd3|cmd4)", RegexOptions.Compiled);
static Dictionary<string, CommandType> Commands = new Dictionary<string, CommandType>();
private static InitDictionary()
{
    Commands.Add("cmd1", cmdType1);
    Commands.Add("cmd2", cmdType2);
    Commands.Add("cmd3", cmdType3);
    Commands.Add("cmd4", cmdType4);
}
public CommandType GetCommandTypeFromCommandString(String command)
{
    Match m = reCommands.Match(command);
    if (m.Success)
    {
        return Commands[m.Groups[1].Value];
    }
    return CommandType.None; // no command
}

#5


I think you should go for more readability than worry about efficiency. These operations are pretty fast. I second Serge, it's unlikely this part of the code is the bottleneck. I would do something like this:

我认为你应该考虑更多的可读性而不是担心效率。这些操作非常快。我是第二个塞尔,这部分代码不太可能成为瓶颈。我会做这样的事情:

public CommandType GetCommandTypeFromCommandString(String command)
{
   for(String currentCommand : allCommands) {
       if(command.StartsWith(currentCommand))
           return currentCommand;
   }
   return CommandType.None;
}

EDIT: As an afterthought, if you know which commands are used most frequently, you could order your array so those commands are at the start... you can also do this with if statements if you keep them.

编辑:作为事后的想法,如果您知道哪些命令最常使用,您可以订购您的阵列,以便这些命令在开始...如果您保留它们,也可以使用if语句执行此操作。

#6


Edit: In light of a misunderstanding of one of the caveats of StopWatch, my original answer doesn't perform so well as StartsWith in combination with StringComparison.Ordinal. Even if you compile the regex with all the correct options it is a little slower, with performance comparable to using StartsWith without any StringComparison settings. However, the regular expression route does give you more flexibility to match patterns whereas StartsWith doesn't, so I've left my original answer for posterity...

编辑:鉴于对StopWatch的一个警告的误解,我的原始答案与StringComparison.Ordinal结合使用时表现不如StartsWith。即使您使用所有正确的选项编译正则表达式,它也会慢一点,其性能可与使用StartsWith而不使用任何StringComparison设置相媲美。但是,正则表达式路径确实可以让您更灵活地匹配模式,而StartsWith却没有,所以我为后代留下了我的原始答案......

Original Answer:

I have to admit, I'm not sure exactly what you're looking for - however, it seems to me that this kind of code is generally parsing a log file of some description looking to pull out useful information. To save you doing all the string comparisons long hand you could generate a regular expression of the values in your enumeration and then parse the correct enum item using the matched item:

我不得不承认,我不确定你到底在想什么 - 但是,在我看来,这种代码通常会解析一些描述的日志文件,以寻找有用的信息。为了节省您长时间进行所有字符串比较,您可以在枚举中生成值的正则表达式,然后使用匹配的项解析正确的枚举项:

public enum CommandType
{
    Acknowledge,
    Status,
    Echo,
    Warning
}
static public CommandType? GetCommandTypeFromString(String command)
{
    var CommandTypes = String.Join("|", Enum.GetNames(typeof(CommandType)));
    var PassedConstant = Regex.Match(command, String.Format("(?i:^({0}))", CommandTypes)).Value;
    if (PassedConstant != String.Empty)
        return (CommandType)Enum.Parse(typeof(CommandType), PassedConstant, true);
    return null;
}

static void Main(string[] args)
{
    Console.WriteLine(GetCommandTypeFromString("Acknowledge that I am great!").ToString());
}

This would pull out CommandType.Acknowledge from the beginning of my string, but only if it existed at the start of the string... it would also pull out the other correct CommandTypes.

这将从我的字符串的开头拉出CommandType.Acknowledge,但只有它存在于字符串的开头...它还会拉出其他正确的CommandTypes。

Doing similar benchmarking to the accepted answer, I got about a 40% performance increase. I ran the accepted code over a million iterations in 10421ms, but mine ran in only 6459ms.

对接受的答案进行类似的基准测试,我的性能提升了大约40%。我在10421ms内运行了接受的代码超过一百万次迭代,但我的运行时间只有6459ms。

Of course, while the if statement you're using may not look as efficient as you'd like, it's still easier to read than using the regex form...

当然,虽然您使用的if语句可能看起来不如您所希望的那样高效,但它仍然比使用正则表达式更容易阅读...

#7


Create an

IDictionary<string,CommandType> 

and populate it with the values.

并用值填充它。

You don't need to compare everything...just look it up in the table.

你不需要比较所有东西......只需在表格中查找即可。

You'll also need to define your command syntax a little better. Require a space between the command and the rest of the line for example...

您还需要更好地定义命令语法。在命令和行的其余部分之间需要一个空格,例如......

#8


How many times is "many many"? I seriously doubt this part of the code is the bottleneck. You could probably optimize it a tiny tiny little bit with a switch statement based on the first letter of each command, assuming they are all different.

“多少次”多少次?我严重怀疑这部分代码是瓶颈。你可以用一个基于每个命令的第一个字母的switch语句来优化它,假设它们都是不同的。

But then again, it is really useful? I wouldn't bet much on it.

但话说回来,这真的很有用吗?我不会打赌它太多了。

#9


I did something similar as an extension method:

我做了类似的扩展方法:

public static bool StartsWith(this string s, params string[] candidate)
{
    string match = candidate.FirstOrDefault(t => s.StartsWith(t));
    return match != default(string);
}

Now this is just a predicate that returns whether or not a string in an array starts with a given string, but you could modify it somewhat:

现在这只是一个谓词,它返回数组中的字符串是否以给定字符串开头,但您可以稍微修改它:

public static int PrefixIndex(this string s, params string[] candidate)
{
    int index = -1;
    string match = candidate.FirstOrDefault(t => { index++; return s.StartsWith(t); });
    return match == default(string) ? -1 : index;
}

and in usage it would be:

在使用中它将是:

int index = command.PrefixIndex(tokenStrings);

if (index >= 0) {
    // convert to your enum
}

In a comment I saw that you wanted to do 30 string compares in 1/40th of a second. My friend, you should be able to do that on a 1 MHz machine. It should be no sweat to do thousands of string compares in 1/40th of a second.

在评论中,我看到你想在1/40秒内进行30次字符串比较。我的朋友,你应该可以在1 MHz的机器上做到这一点。在1/40秒内进行数千次字符串比较应该没有汗水。

#10


A trie's almost certainly going to be the fastest approach. If the prefixes were the same length, you might be able to come up with a faster implementation by hashing the prefix to get an index into an array, but that breaks down if you don't know how many bytes to hash.

特里肯定会成为最快的方法。如果前缀长度相同,那么您可以通过散列前缀来获得更快的实现来获取数组的索引,但如果您不知道要散列多少字节,则会出现故障。

#11


NOTE: I'm not demonstrating using Exception handling to control program flow. Enum.Parse will throw an exception if the string name of the Enum doesn't exist. The Catch clause just returns the default CommandType of None like the questioner's sample code does.

注意:我没有演示如何使用异常处理来控制程序流。如果Enum的字符串名称不存在,Enum.Parse将抛出异常。 Catch子句只返回默认的CommandType为None,就像提问者的示例代码一样。

If the object is just to return the actual Enum object given the string name couldn't you use:

如果对象只是为了返回实际的Enum对象给定字符串名称,则不能使用:

try
{
    return (CommandType)Enum.Parse(typeof(CommandType), strCmdName, true);
}
catch (Exception)
{
    return CommandType.None;
}

#1


One optimization would be to use the StringComparison enum to specify that you only want ordinal comparison. Like this:

一种优化方法是使用StringComparison枚举来指定您只需要进行序数比较。像这样:

if(command.StartsWith(CommandConstants.Acknowledge, StringComparison.Ordinal))
    return CommandType.Acknowledge;

If you don't specify a string comparison method the current culture will be used for comparison and that slows things down a bit.

如果您没有指定字符串比较方法,那么当前文化将用于比较,这会使事情变慢。

I did some (really really naive) benchmarking:

我做了一些(真的很天真)基准测试:

var a = "foo bar foo";
var b = "foo";

int numTimes = 1000000;

Benchmark.Time(() => a.StartsWith(b, StringComparison.Ordinal), "ordinal", numTimes);
Benchmark.Time(() => a.StartsWith(b), "culture sensitive", numTimes);

Which produced the following results:

其中产生了以下结果:

ordinal ran 1000000 times in 35.6033 ms
culture sensitive ran 1000000 times in 175.5859 ms

You should also order your comparisons so that the most likely tokens are being compared first (happy path).

您还应该对比较进行排序,以便首先比较最可能的标记(快乐路径)。

Theese optimizations are a simple way of making your current implementation perform better but if performance really is critical (and I mean really critical) you should be looking towards implementing some sort of state machine.

Theese优化是一种简单的方法,可以使您当前的实现更好,但如果性能真的很关键(我的意思是非常关键),那么您应该考虑实现某种状态机。

#2


Similar in concept to the FSM answer by Vojislav, you could try putting the constants into a trie. You can then replace your sequence of comparisons with a single traversal of the trie.

在概念上与Vojislav的FSM答案类似,你可以尝试将常数放入trie中。然后,您可以使用trie的单次遍历替换您的比较序列。

There is a C# trie implementation described here.

这里描述了一个C#trie实现。

#3


It's quite a bit of work, but you could construct an FSM based on each of the tokens. The FSM would accept characters from the command string one by one; it would have one final state for each token and an additional final state to which to go when the command doesn't start with any of the tokens.

这是相当多的工作,但你可以根据每个令牌构建一个FSM。 FSM将逐个接受命令字符串中的字符;对于每个令牌,它将具有一个最终状态,并且当命令不以任何令牌开始时,将具有另外的最终状态。

#4


I think you could do better with a regular expression and a Dictionary:

我认为你可以用正则表达式和词典做得更好:

static Regex reCommands = new Regex("^(cmd1|cmd2|cmd3|cmd4)", RegexOptions.Compiled);
static Dictionary<string, CommandType> Commands = new Dictionary<string, CommandType>();
private static InitDictionary()
{
    Commands.Add("cmd1", cmdType1);
    Commands.Add("cmd2", cmdType2);
    Commands.Add("cmd3", cmdType3);
    Commands.Add("cmd4", cmdType4);
}
public CommandType GetCommandTypeFromCommandString(String command)
{
    Match m = reCommands.Match(command);
    if (m.Success)
    {
        return Commands[m.Groups[1].Value];
    }
    return CommandType.None; // no command
}

#5


I think you should go for more readability than worry about efficiency. These operations are pretty fast. I second Serge, it's unlikely this part of the code is the bottleneck. I would do something like this:

我认为你应该考虑更多的可读性而不是担心效率。这些操作非常快。我是第二个塞尔,这部分代码不太可能成为瓶颈。我会做这样的事情:

public CommandType GetCommandTypeFromCommandString(String command)
{
   for(String currentCommand : allCommands) {
       if(command.StartsWith(currentCommand))
           return currentCommand;
   }
   return CommandType.None;
}

EDIT: As an afterthought, if you know which commands are used most frequently, you could order your array so those commands are at the start... you can also do this with if statements if you keep them.

编辑:作为事后的想法,如果您知道哪些命令最常使用,您可以订购您的阵列,以便这些命令在开始...如果您保留它们,也可以使用if语句执行此操作。

#6


Edit: In light of a misunderstanding of one of the caveats of StopWatch, my original answer doesn't perform so well as StartsWith in combination with StringComparison.Ordinal. Even if you compile the regex with all the correct options it is a little slower, with performance comparable to using StartsWith without any StringComparison settings. However, the regular expression route does give you more flexibility to match patterns whereas StartsWith doesn't, so I've left my original answer for posterity...

编辑:鉴于对StopWatch的一个警告的误解,我的原始答案与StringComparison.Ordinal结合使用时表现不如StartsWith。即使您使用所有正确的选项编译正则表达式,它也会慢一点,其性能可与使用StartsWith而不使用任何StringComparison设置相媲美。但是,正则表达式路径确实可以让您更灵活地匹配模式,而StartsWith却没有,所以我为后代留下了我的原始答案......

Original Answer:

I have to admit, I'm not sure exactly what you're looking for - however, it seems to me that this kind of code is generally parsing a log file of some description looking to pull out useful information. To save you doing all the string comparisons long hand you could generate a regular expression of the values in your enumeration and then parse the correct enum item using the matched item:

我不得不承认,我不确定你到底在想什么 - 但是,在我看来,这种代码通常会解析一些描述的日志文件,以寻找有用的信息。为了节省您长时间进行所有字符串比较,您可以在枚举中生成值的正则表达式,然后使用匹配的项解析正确的枚举项:

public enum CommandType
{
    Acknowledge,
    Status,
    Echo,
    Warning
}
static public CommandType? GetCommandTypeFromString(String command)
{
    var CommandTypes = String.Join("|", Enum.GetNames(typeof(CommandType)));
    var PassedConstant = Regex.Match(command, String.Format("(?i:^({0}))", CommandTypes)).Value;
    if (PassedConstant != String.Empty)
        return (CommandType)Enum.Parse(typeof(CommandType), PassedConstant, true);
    return null;
}

static void Main(string[] args)
{
    Console.WriteLine(GetCommandTypeFromString("Acknowledge that I am great!").ToString());
}

This would pull out CommandType.Acknowledge from the beginning of my string, but only if it existed at the start of the string... it would also pull out the other correct CommandTypes.

这将从我的字符串的开头拉出CommandType.Acknowledge,但只有它存在于字符串的开头...它还会拉出其他正确的CommandTypes。

Doing similar benchmarking to the accepted answer, I got about a 40% performance increase. I ran the accepted code over a million iterations in 10421ms, but mine ran in only 6459ms.

对接受的答案进行类似的基准测试,我的性能提升了大约40%。我在10421ms内运行了接受的代码超过一百万次迭代,但我的运行时间只有6459ms。

Of course, while the if statement you're using may not look as efficient as you'd like, it's still easier to read than using the regex form...

当然,虽然您使用的if语句可能看起来不如您所希望的那样高效,但它仍然比使用正则表达式更容易阅读...

#7


Create an

IDictionary<string,CommandType> 

and populate it with the values.

并用值填充它。

You don't need to compare everything...just look it up in the table.

你不需要比较所有东西......只需在表格中查找即可。

You'll also need to define your command syntax a little better. Require a space between the command and the rest of the line for example...

您还需要更好地定义命令语法。在命令和行的其余部分之间需要一个空格,例如......

#8


How many times is "many many"? I seriously doubt this part of the code is the bottleneck. You could probably optimize it a tiny tiny little bit with a switch statement based on the first letter of each command, assuming they are all different.

“多少次”多少次?我严重怀疑这部分代码是瓶颈。你可以用一个基于每个命令的第一个字母的switch语句来优化它,假设它们都是不同的。

But then again, it is really useful? I wouldn't bet much on it.

但话说回来,这真的很有用吗?我不会打赌它太多了。

#9


I did something similar as an extension method:

我做了类似的扩展方法:

public static bool StartsWith(this string s, params string[] candidate)
{
    string match = candidate.FirstOrDefault(t => s.StartsWith(t));
    return match != default(string);
}

Now this is just a predicate that returns whether or not a string in an array starts with a given string, but you could modify it somewhat:

现在这只是一个谓词,它返回数组中的字符串是否以给定字符串开头,但您可以稍微修改它:

public static int PrefixIndex(this string s, params string[] candidate)
{
    int index = -1;
    string match = candidate.FirstOrDefault(t => { index++; return s.StartsWith(t); });
    return match == default(string) ? -1 : index;
}

and in usage it would be:

在使用中它将是:

int index = command.PrefixIndex(tokenStrings);

if (index >= 0) {
    // convert to your enum
}

In a comment I saw that you wanted to do 30 string compares in 1/40th of a second. My friend, you should be able to do that on a 1 MHz machine. It should be no sweat to do thousands of string compares in 1/40th of a second.

在评论中,我看到你想在1/40秒内进行30次字符串比较。我的朋友,你应该可以在1 MHz的机器上做到这一点。在1/40秒内进行数千次字符串比较应该没有汗水。

#10


A trie's almost certainly going to be the fastest approach. If the prefixes were the same length, you might be able to come up with a faster implementation by hashing the prefix to get an index into an array, but that breaks down if you don't know how many bytes to hash.

特里肯定会成为最快的方法。如果前缀长度相同,那么您可以通过散列前缀来获得更快的实现来获取数组的索引,但如果您不知道要散列多少字节,则会出现故障。

#11


NOTE: I'm not demonstrating using Exception handling to control program flow. Enum.Parse will throw an exception if the string name of the Enum doesn't exist. The Catch clause just returns the default CommandType of None like the questioner's sample code does.

注意:我没有演示如何使用异常处理来控制程序流。如果Enum的字符串名称不存在,Enum.Parse将抛出异常。 Catch子句只返回默认的CommandType为None,就像提问者的示例代码一样。

If the object is just to return the actual Enum object given the string name couldn't you use:

如果对象只是为了返回实际的Enum对象给定字符串名称,则不能使用:

try
{
    return (CommandType)Enum.Parse(typeof(CommandType), strCmdName, true);
}
catch (Exception)
{
    return CommandType.None;
}