将字符串解析为术语词典的最佳方法

时间:2023-02-05 14:06:02

Input - string: "TAG1xxxTAG2yyyTAG3zzzTAG1tttTAG1bbb"

输入 - 字符串:“TAG1xxxTAG2yyyTAG3zzzTAG1tttTAG1bbb”

Expected result: pairs TAG1 = {xxx,,ttt,bbb}, TAG2 = {yyy}, TAG3 = {zzz}.

预期结果:对TAG1 = {xxx ,, ttt,bbb},TAG2 = {yyy},TAG3 = {zzz}。

I did it using regexps, but I'm really confused by using Regex.Replace and not using return value. I want to improve this code, so how can it be realized?

我使用regexp做到了,但我真的很困惑使用Regex.Replace并且不使用返回值。我想改进这段代码,那怎么能实现呢?

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace TermsTest
{
    class Program
    {
        static void Main(string[] args)
        {
            string[] tags = { "TAG1", "TAG2", "TAG3", "TAG4", "TAG5", "TAG6", "TAG7", "TAG8" };
            string file = "TAG2jjfjfjndbfdjTAG1qqqqqqqTAG3uytygh fhdjdfTAG5hgjdhfghTAG6trgfmxc hdfhdTAG2jfksksdhjskTAG3kdjbjvbsjTAG2jskjdjdvjvbxjkvbjdTAG2jkxcndjcjbkjn";

            string tag = "(" + string.Join("|", tags) + ")";

            var dictionary = new Dictionary<string, List<string>>(tags.Length);
            Regex.Replace(file, string.Format(@"({0})(.+?)(?={0}|$)", tag), match =>
                                                                            {
                                                                                string key = match.Groups[1].Value, value = match.Groups[3].Value;
                                                                                if (dictionary.ContainsKey(key))
                                                                                    dictionary[key].Add(value);
                                                                                else
                                                                                    dictionary[key] = new List<string> {value};
                                                                                return "";
                                                                            });
            foreach (var pair in dictionary)
            {
                Console.Write(pair.Key + " =\t");
                foreach (var entry in pair.Value)
                {
                    Console.Write(entry + " ");
                }
                Console.WriteLine();
                Console.WriteLine();
            }
        }
    }
}

4 个解决方案

#1


3  

string input = "TAG1xxxTAG2yyyTAG3zzzTAG1tttTAG1bbb";
var lookup = Regex.Matches(input, @"(TAG\d)(.+?)(?=TAG|$)")
                    .Cast<Match>()
                    .ToLookup(m => m.Groups[1].Value, m => m.Groups[2].Value);

foreach (var kv in lookup)
{
    Console.WriteLine(kv.Key + " => " + String.Join(", ", kv));
}

OUTPUT:

OUTPUT:

TAG1 => xxx, ttt, bbb
TAG2 => yyy
TAG3 => zzz

#2


0  

What are you trying to do is simply grouping of the values of the same tag, so it should be easier to use GroupBy method:

你想要做的只是简单地对同一个标签的值进行分组,因此使用GroupBy方法应该更容易:

string input = "TAG1xxxTAG2yyyTAG3zzzTAG1tttTAG1bbb";
var list = Regex.Matches(input, @"(TAG\d+)(.+?)(?=TAG\d+|$)")
                .Cast<Match>()
                .GroupBy(m => m.Groups[1].Value,
                         (key, values) => string.Format("{0} = {{{1}}}", 
                                             key, 
                                             string.Join(", ", 
                                                values.Select(v => v.Groups[2]))));
var output = string.Join(", ", list);

This produces as a output string "TAG1 = {xxx, ttt, bbb}, TAG2 = {yyy}, TAG3 = {zzz}"

这产生输出字符串“TAG1 = {xxx,ttt,bbb},TAG2 = {yyy},TAG3 = {zzz}”

#3


0  

I'm not sure that I'm aware of all your assumptions and conventions in this problem; but this gave me similar result:

我不确定我是否知道你在这个问题上的所有假设和惯例;但这给了我类似的结果:

var tagColl = string.Join("|", tags);
var tagGroup = string.Format("(?<tag>{0})(?<val>[a-z]*)", tagColl);

var result = from x in Regex.Matches(file, tagGroup).Cast<Match>()
                where x.Success
                let pair = new { fst = x.Groups["tag"].Value, snd = x.Groups["val"].Value }
                group pair by pair.fst into g
                select g;

And a simple test would be:

一个简单的测试将是:

Console.WriteLine(string.Join("\r\n", from g in result
                                        let coll = string.Join(", ", from item in g select item.snd)
                                        select string.Format("{0}: {{{1}}}", g.Key, coll)));

#4


0  

This is a perfect job for the .NET CaptureCollection object—a unique .NET feature that lets you reuse the same capture group multiple times.

这是.NET CaptureCollection对象的完美工作 - 一种独特的.NET功能,可让您多次重用相同的捕获组。

Use this regex and use Matches to create a MatchCollection:

使用此正则表达式并使用匹配来创建MatchCollection:

(?:TAG1(.*?(?=TAG|$)))?(?:TAG2(.*?(?=TAG|$)))?(?:TAG3(.*?(?=TAG|$)))?

Then inspect the captures:

然后检查捕获:

  • Groups[1].Captures will contain all the TAG1
  • 组[1] .Captures将包含所有TAG1
  • Groups[2].Captures will contain all the TAG2
  • 组[2] .Captures将包含所有TAG2
  • Groups[3].Captures will contain all the TAG3
  • 组[3] .Captures将包含所有TAG3

From there it's a short step to your final data structure.

从那里开始,这是您最终数据结构的一小步。

To reduce the potential for backtracking, you can make the tokens atomic:

为了减少回溯的可能性,您可以使令牌成为原子:

(?>(?:TAG1(.*?(?=TAG|$)))?)(?>(?:TAG2(.*?(?=TAG|$)))?)(?>(?:TAG3(.*?(?=TAG|$)))?)

For details about how this works, see Capture Groups that can be Quantified.

有关其工作原理的详细信息,请参阅可以量化的捕获组。

#1


3  

string input = "TAG1xxxTAG2yyyTAG3zzzTAG1tttTAG1bbb";
var lookup = Regex.Matches(input, @"(TAG\d)(.+?)(?=TAG|$)")
                    .Cast<Match>()
                    .ToLookup(m => m.Groups[1].Value, m => m.Groups[2].Value);

foreach (var kv in lookup)
{
    Console.WriteLine(kv.Key + " => " + String.Join(", ", kv));
}

OUTPUT:

OUTPUT:

TAG1 => xxx, ttt, bbb
TAG2 => yyy
TAG3 => zzz

#2


0  

What are you trying to do is simply grouping of the values of the same tag, so it should be easier to use GroupBy method:

你想要做的只是简单地对同一个标签的值进行分组,因此使用GroupBy方法应该更容易:

string input = "TAG1xxxTAG2yyyTAG3zzzTAG1tttTAG1bbb";
var list = Regex.Matches(input, @"(TAG\d+)(.+?)(?=TAG\d+|$)")
                .Cast<Match>()
                .GroupBy(m => m.Groups[1].Value,
                         (key, values) => string.Format("{0} = {{{1}}}", 
                                             key, 
                                             string.Join(", ", 
                                                values.Select(v => v.Groups[2]))));
var output = string.Join(", ", list);

This produces as a output string "TAG1 = {xxx, ttt, bbb}, TAG2 = {yyy}, TAG3 = {zzz}"

这产生输出字符串“TAG1 = {xxx,ttt,bbb},TAG2 = {yyy},TAG3 = {zzz}”

#3


0  

I'm not sure that I'm aware of all your assumptions and conventions in this problem; but this gave me similar result:

我不确定我是否知道你在这个问题上的所有假设和惯例;但这给了我类似的结果:

var tagColl = string.Join("|", tags);
var tagGroup = string.Format("(?<tag>{0})(?<val>[a-z]*)", tagColl);

var result = from x in Regex.Matches(file, tagGroup).Cast<Match>()
                where x.Success
                let pair = new { fst = x.Groups["tag"].Value, snd = x.Groups["val"].Value }
                group pair by pair.fst into g
                select g;

And a simple test would be:

一个简单的测试将是:

Console.WriteLine(string.Join("\r\n", from g in result
                                        let coll = string.Join(", ", from item in g select item.snd)
                                        select string.Format("{0}: {{{1}}}", g.Key, coll)));

#4


0  

This is a perfect job for the .NET CaptureCollection object—a unique .NET feature that lets you reuse the same capture group multiple times.

这是.NET CaptureCollection对象的完美工作 - 一种独特的.NET功能,可让您多次重用相同的捕获组。

Use this regex and use Matches to create a MatchCollection:

使用此正则表达式并使用匹配来创建MatchCollection:

(?:TAG1(.*?(?=TAG|$)))?(?:TAG2(.*?(?=TAG|$)))?(?:TAG3(.*?(?=TAG|$)))?

Then inspect the captures:

然后检查捕获:

  • Groups[1].Captures will contain all the TAG1
  • 组[1] .Captures将包含所有TAG1
  • Groups[2].Captures will contain all the TAG2
  • 组[2] .Captures将包含所有TAG2
  • Groups[3].Captures will contain all the TAG3
  • 组[3] .Captures将包含所有TAG3

From there it's a short step to your final data structure.

从那里开始,这是您最终数据结构的一小步。

To reduce the potential for backtracking, you can make the tokens atomic:

为了减少回溯的可能性,您可以使令牌成为原子:

(?>(?:TAG1(.*?(?=TAG|$)))?)(?>(?:TAG2(.*?(?=TAG|$)))?)(?>(?:TAG3(.*?(?=TAG|$)))?)

For details about how this works, see Capture Groups that can be Quantified.

有关其工作原理的详细信息,请参阅可以量化的捕获组。