在ANSI C中使用regex扫描和交换字符串值

时间:2022-09-06 13:05:00

I want to transform a given input in my c program, for example:

我想转换c程序中给定的输入,例如:

foo_bar_something-like_this

into this:

到这个:

thissomethingbarfoolike

Explanation:

解释:

Every time I get a _, the following text up to, but not including, the next _ or - (or the end of the line) needs to go to the beginning (and the preceding _ needs to be removed). Every time I get a -, the following text up to, but not including, the next _ or - (or the end of the line) needs to be appended to the end (with the - removed).

每次我得到_,下面的文本到,但不包括,下一个_(或行的末尾)需要去到开始(并且前面的_需要删除)。每次我得到一个-,下面的文字直到,但不包括,下一个_或-(或行尾)需要加到末尾(去掉-)。

If possible, I would like to use regular expressions in order to achieve this. If there is a way to do this directly from stdin, it would be optimal.

如果可能的话,我想使用正则表达式来实现这一点。如果有一种方法可以直接从stdin中实现这一点,那么它将是最优的。

Note that it is not necessary to do it in a single regular expression. I can do some kind of loop to do this. In this case I believe I would have to capture the data in a variable first and then do my algorithm.

注意,不需要在单个正则表达式中执行。我可以做一些循环。在这种情况下,我认为我必须先捕获变量中的数据,然后再执行算法。

I have to do this operation for every line in my input, each of which ends with \n.

我必须为输入中的每一行做这个操作,每一行都以\n结尾。

EDIT: I had already written a code for this without using anything related to regex, besides I should have posted it in the first place, my apologies. I know scanf should not be used to prevent buffer overflow, but the strings are already validated before being used in the program. The code is the following:

编辑:我已经写了一段代码,没有使用任何与正则表达式相关的东西,而且我应该把它放在第一位,抱歉。我知道scanf不应该用于防止缓冲区溢出,但是在程序中使用之前已经对字符串进行了验证。代码如下:

#include <stdio.h>
#include <stdlib.h>
#define MAX_LENGTH 100001 //A fixed maximum amount of characters per line
int main(){
  char c=0;
  /*
  *home: 1 (append to the start), 0 (append to the end)
  *str: array of words appended to the begining
  *strlen: length of str
  *line: string of words appended to the end
  *linelen: length of line
  *word: word between a combination of symbols - and _
  *wordlen: length of the actual word
  */
  int home,strlen,linelen,wordlen;
  char **str,*line,*word;
  str=(char**)malloc(MAX_LENGTH*sizeof(char*));
  while(c!=EOF && scanf("%c",&c)!=EOF){
    line=(char*)malloc(MAX_LENGTH);
    word=(char*)malloc(MAX_LENGTH);
    line[0]=word[0]='\0';
    home=strlen=linelen=wordlen=0;
    while(c!='\n'){
      if(c=='-'){ //put word in str and restart word to '\0'
        home=1;
        str[strlen++]=word;
        word=(char*)malloc(MAX_LENGTH);
        wordlen=0;
        word[0]='\0';
      }else if(c=='_'){ //put word in str and restart word to '\0'
        home=0;
        str[strlen++]=word;
        word=(char*)malloc(MAX_LENGTH);
        wordlen=0;
        word[0]='\0';
      }else if(home){ //append the c to word
        word[wordlen++]=c;
        word[wordlen]='\0';
      }else{ //append c to line
        line[linelen++]=c;
        line[linelen]='\0';
      }
      scanf("%c",&c); //scan the next character
    }
    printf("%s",word); //print the last word
    free(word);
    while(strlen--){ //print each word stored in the array
      printf("%s",str[strlen]);
      free(str[strlen]);
    }
    printf("%s\n",line); //print the text appended to the end
    free(line);
  }
  return 0;
}

2 个解决方案

#1


1  

I do not think regex can do what you are asking for, so I wrote a simple state machine solution in C.

我认为regex不能做您所要求的事情,所以我用C编写了一个简单的状态机解决方案。

//
//Discription: This Program takes a string of character input, and parses it
//using underscore and hyphen as queue to either send data to
//the begining or end of the output.
//
//Date: 11/18/2017
//
//Author: Elizabeth Harasymiw
//

#include <stdio.h>
#include <string.h>
#define MAX_SIZE 100

typedef enum{ AppendEnd, AppendBegin } State; //Used to track either writeing to begining or end of output

int main(int argc,char**argv){
        char ch;                   //Used to hold the character currently looking at
        State state=AppendEnd;     //creates the State
        char Buffer[MAX_SIZE]={};  //Current Ouput
        char Word[MAX_SIZE]={};    //Pending data to the Buffer
        char *c;                   //Used to index and clear Word
        while((ch = getc(stdin)) != EOF){
                if(ch=='\n')continue;
                switch(state){
                        case AppendEnd:
                                if( ch == '-' )
                                        break;
                                if( ch == '_'){
                                        state = AppendBegin;     //Change State
                                        strcat(Buffer, Word);    //Add Word to end of Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                {
                                        int postion = -1;
                                        while(Word[++postion]);  //Find end of Word
                                        Word[postion] = ch;      //Add Character to end of Word
                                }
                                break;
                        case AppendBegin:
                                if( ch == '-' ){
                                        state = AppendEnd;       //Change State
                                        strcat(Word, Buffer);    //Add Output to end of Word
                                        strcpy(Buffer, Word);    //Move Output from Word back to Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                if( ch == '_'){
                                        strcat(Word, Buffer);    //Add Output to end of Word
                                        strcpy(Buffer, Word);    //Move Output from Word back to Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                {
                                        int postion = -1;
                                        while(Word[++postion]);  //Find end of Word
                                        Word[postion] = ch;      //Add Character to end of Word
                                }
                                break;

                }
        }
        switch(state){ //Finish adding the Last Word Buffer to Output
                case AppendEnd:
                        strcat(Buffer, Word); //Add Word to end of Output
                        break;
                case AppendBegin:
                        strcat(Word, Buffer); //Add Output to end of Word
                        strcpy(Buffer, Word); //Move Output from Word back to Output
                        break;
        }

        printf("%s\n", Buffer);
}

#2


1  

This can be done with regexes using loops, assuming you aren't strictly restricted to ANSI. The following uses PCRE.

这可以通过使用循环的regexes来实现,假设您没有被严格地限制在ANSI中。以下使用PCRE。

(Note that this answer deliberately does not show the C code. It is only meant to guide the OP by showing a possible technique for using regexes, as it is not obvious how to do so.)

(注意,这个答案故意不显示C代码。它只是通过展示使用regexes的一种可能技术来指导OP,因为不清楚如何使用regexes)。

Method A

Uses two different regexes.

使用两个不同的regex。

Part 1/2 (Demo)

半部分(演示)

Regex: ([^_\n]*)_([^_\n]*)(_.*)? Substitution: $2--$1$3

Regex:((^ _ \ n)*)_([^ _ \ n]*)(_ . *)?替换:2——1美元3美元

This moves the text following the next underscore to the beginning, appending -- to it. It also removes the underscore. You need to repeat this substitution in a loop until no more matches are found.

这将在下一个下划线之后的文本移动到开始,附加到它。它还删除下划线。您需要在循环中重复此替换,直到不再找到匹配。

For your example, this leads to the following string:

对于您的示例,这将导致以下字符串:

this--something-like--bar--foo

Part 2/2 (Demo):

2/2部分(演示):

Regex: (.*)(?<!-)-(?!-)(\w+)(.*) Substitution: $1$3--$2

Regex:(. *)(? < !——)——(? !)(\ w +)(. *)替换:$ 1 $ 3 - $ 2

This moves the text following the next single hyphen to the end, prepending -- to it. It also removes the hyphen. You need to repeat this substitution in a loop until no more matches are found.

这将把下一个连字符后面的文本移动到末尾,在它之前。它也去掉了连字符。您需要在循环中重复此替换,直到不再找到匹配。

For your example, this leads to the following string:

对于您的示例,这将导致以下字符串:

this--something--bar--foo--like

Remove the hyphens from the string to get your result.


Note that the first regex can be simplified to the following and will still work:

从字符串中删除连字符以获得结果。注意,第一个regex可以简化为以下内容,并且仍然有效:

([^_]*)_([^_]*)(_.*)?

(^ _ *)_(^ _ *)(_ . *)?

The \ns were only required to show the intermediate loop results in the demos.

\ns只需要在演示中显示中间循环结果。

The following are the reasons for using -- as a new separator:

以下是使用-作为新分隔符的原因:

  • A separator is required so that the regex in part 2 can find the correct end of hyphen prefixed text;
  • 需要一个分隔符,以便第2部分中的regex能够找到连字符前缀文本的正确结尾;
  • A underscore can't be used as it would interfere with the regex in part 1 causing an infinite loop;
  • 下划线不能使用,因为它会干扰第1部分中的regex,导致无限循环;
  • A hyphen can't be used as it would cause the regex in part 2 to find extraneous text;
  • 连字符不能使用,因为它会导致第2部分中的regex查找无关的文本;
  • Although any single character delimiter which can never exist in the input would work and lead to a simpler part 2 regexp, -- is one of the delimiters which allows any and every character* in the input.
  • 尽管输入中不存在的任何单个字符分隔符都可以工作并导致更简单的第2部分regexp,但是—是允许输入中的任何字符*的分隔符之一。
  • \n is actually the perfect * delimiter, but can't be used in this answer as it would not allow the demo to show the intermediate results. (Hint: it should be the actual delimiter used by you.)
  • \n实际上是完美的*分隔符,但是不能在这个答案中使用,因为它不允许演示程序显示中间结果。(提示:它应该是您使用的实际分隔符。)

Method B

Combines the two regexes.

结合了两个regex。

(Demo)

(演示)

Regex: ([^_\n]*)_([^_\n]*)(_.*)?|(.*)(?<!-)-(?!-)(\w+)(.*) Substitution: $2--$1$3$4$6--$5

Regex:((^ _ \ n)*)_([^ _ \ n]*)(_ . *)? |(. *)(? < !——)-(? !)(\ w +)(. *)替换:2美元1 3 4美元6美元——5美元

For your example, this leads to the following string:

对于您的示例,这将导致以下字符串:

----this------something--bar--foo----like

As before, remove all the hyphens from the string to get your result.

与前面一样,从字符串中删除所有的连字符以获得结果。

Also as before, the regex can be simplified to the following and will still work:

同样,regex可以简化为以下内容,并且仍然有效:

([^_]*)_([^_]*)(_.*)?|(.*)(?<!-)-(?!-)(\w+)(.*)

(^ _ *)_(^ _ *)(_ . *)? |(. *)(? < !——)-(? !)(\ w +)(. *)

This combined regex works because capturing groups 1,2 & 3 are mutually exclusive to groups 4, 5 & 6. There is a side effect of extra hyphens, however.

这个组合的regex工作,因为捕获组1、2和3对组4、5和6是相互排斥的。然而,额外的连字符也有副作用。

Caveat:

* Using -- as a delimiter fails if the input contains consecutive hyphens. All the other "good" delimiters have a similar failure edge case. Only \n is guaranteed not to exist in the input and thus is failsafe.

*使用——如果输入包含连续的连字符,作为分隔符将失败。所有其他“好”分隔符都有类似的失败边缘情况。只有\n保证在输入中不存在,因此是万无一失的。

#1


1  

I do not think regex can do what you are asking for, so I wrote a simple state machine solution in C.

我认为regex不能做您所要求的事情,所以我用C编写了一个简单的状态机解决方案。

//
//Discription: This Program takes a string of character input, and parses it
//using underscore and hyphen as queue to either send data to
//the begining or end of the output.
//
//Date: 11/18/2017
//
//Author: Elizabeth Harasymiw
//

#include <stdio.h>
#include <string.h>
#define MAX_SIZE 100

typedef enum{ AppendEnd, AppendBegin } State; //Used to track either writeing to begining or end of output

int main(int argc,char**argv){
        char ch;                   //Used to hold the character currently looking at
        State state=AppendEnd;     //creates the State
        char Buffer[MAX_SIZE]={};  //Current Ouput
        char Word[MAX_SIZE]={};    //Pending data to the Buffer
        char *c;                   //Used to index and clear Word
        while((ch = getc(stdin)) != EOF){
                if(ch=='\n')continue;
                switch(state){
                        case AppendEnd:
                                if( ch == '-' )
                                        break;
                                if( ch == '_'){
                                        state = AppendBegin;     //Change State
                                        strcat(Buffer, Word);    //Add Word to end of Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                {
                                        int postion = -1;
                                        while(Word[++postion]);  //Find end of Word
                                        Word[postion] = ch;      //Add Character to end of Word
                                }
                                break;
                        case AppendBegin:
                                if( ch == '-' ){
                                        state = AppendEnd;       //Change State
                                        strcat(Word, Buffer);    //Add Output to end of Word
                                        strcpy(Buffer, Word);    //Move Output from Word back to Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                if( ch == '_'){
                                        strcat(Word, Buffer);    //Add Output to end of Word
                                        strcpy(Buffer, Word);    //Move Output from Word back to Output
                                        for(c=Word;*c;c++)*c=0;  //Clear Word
                                        break;
                                }
                                {
                                        int postion = -1;
                                        while(Word[++postion]);  //Find end of Word
                                        Word[postion] = ch;      //Add Character to end of Word
                                }
                                break;

                }
        }
        switch(state){ //Finish adding the Last Word Buffer to Output
                case AppendEnd:
                        strcat(Buffer, Word); //Add Word to end of Output
                        break;
                case AppendBegin:
                        strcat(Word, Buffer); //Add Output to end of Word
                        strcpy(Buffer, Word); //Move Output from Word back to Output
                        break;
        }

        printf("%s\n", Buffer);
}

#2


1  

This can be done with regexes using loops, assuming you aren't strictly restricted to ANSI. The following uses PCRE.

这可以通过使用循环的regexes来实现,假设您没有被严格地限制在ANSI中。以下使用PCRE。

(Note that this answer deliberately does not show the C code. It is only meant to guide the OP by showing a possible technique for using regexes, as it is not obvious how to do so.)

(注意,这个答案故意不显示C代码。它只是通过展示使用regexes的一种可能技术来指导OP,因为不清楚如何使用regexes)。

Method A

Uses two different regexes.

使用两个不同的regex。

Part 1/2 (Demo)

半部分(演示)

Regex: ([^_\n]*)_([^_\n]*)(_.*)? Substitution: $2--$1$3

Regex:((^ _ \ n)*)_([^ _ \ n]*)(_ . *)?替换:2——1美元3美元

This moves the text following the next underscore to the beginning, appending -- to it. It also removes the underscore. You need to repeat this substitution in a loop until no more matches are found.

这将在下一个下划线之后的文本移动到开始,附加到它。它还删除下划线。您需要在循环中重复此替换,直到不再找到匹配。

For your example, this leads to the following string:

对于您的示例,这将导致以下字符串:

this--something-like--bar--foo

Part 2/2 (Demo):

2/2部分(演示):

Regex: (.*)(?<!-)-(?!-)(\w+)(.*) Substitution: $1$3--$2

Regex:(. *)(? < !——)——(? !)(\ w +)(. *)替换:$ 1 $ 3 - $ 2

This moves the text following the next single hyphen to the end, prepending -- to it. It also removes the hyphen. You need to repeat this substitution in a loop until no more matches are found.

这将把下一个连字符后面的文本移动到末尾,在它之前。它也去掉了连字符。您需要在循环中重复此替换,直到不再找到匹配。

For your example, this leads to the following string:

对于您的示例,这将导致以下字符串:

this--something--bar--foo--like

Remove the hyphens from the string to get your result.


Note that the first regex can be simplified to the following and will still work:

从字符串中删除连字符以获得结果。注意,第一个regex可以简化为以下内容,并且仍然有效:

([^_]*)_([^_]*)(_.*)?

(^ _ *)_(^ _ *)(_ . *)?

The \ns were only required to show the intermediate loop results in the demos.

\ns只需要在演示中显示中间循环结果。

The following are the reasons for using -- as a new separator:

以下是使用-作为新分隔符的原因:

  • A separator is required so that the regex in part 2 can find the correct end of hyphen prefixed text;
  • 需要一个分隔符,以便第2部分中的regex能够找到连字符前缀文本的正确结尾;
  • A underscore can't be used as it would interfere with the regex in part 1 causing an infinite loop;
  • 下划线不能使用,因为它会干扰第1部分中的regex,导致无限循环;
  • A hyphen can't be used as it would cause the regex in part 2 to find extraneous text;
  • 连字符不能使用,因为它会导致第2部分中的regex查找无关的文本;
  • Although any single character delimiter which can never exist in the input would work and lead to a simpler part 2 regexp, -- is one of the delimiters which allows any and every character* in the input.
  • 尽管输入中不存在的任何单个字符分隔符都可以工作并导致更简单的第2部分regexp,但是—是允许输入中的任何字符*的分隔符之一。
  • \n is actually the perfect * delimiter, but can't be used in this answer as it would not allow the demo to show the intermediate results. (Hint: it should be the actual delimiter used by you.)
  • \n实际上是完美的*分隔符,但是不能在这个答案中使用,因为它不允许演示程序显示中间结果。(提示:它应该是您使用的实际分隔符。)

Method B

Combines the two regexes.

结合了两个regex。

(Demo)

(演示)

Regex: ([^_\n]*)_([^_\n]*)(_.*)?|(.*)(?<!-)-(?!-)(\w+)(.*) Substitution: $2--$1$3$4$6--$5

Regex:((^ _ \ n)*)_([^ _ \ n]*)(_ . *)? |(. *)(? < !——)-(? !)(\ w +)(. *)替换:2美元1 3 4美元6美元——5美元

For your example, this leads to the following string:

对于您的示例,这将导致以下字符串:

----this------something--bar--foo----like

As before, remove all the hyphens from the string to get your result.

与前面一样,从字符串中删除所有的连字符以获得结果。

Also as before, the regex can be simplified to the following and will still work:

同样,regex可以简化为以下内容,并且仍然有效:

([^_]*)_([^_]*)(_.*)?|(.*)(?<!-)-(?!-)(\w+)(.*)

(^ _ *)_(^ _ *)(_ . *)? |(. *)(? < !——)-(? !)(\ w +)(. *)

This combined regex works because capturing groups 1,2 & 3 are mutually exclusive to groups 4, 5 & 6. There is a side effect of extra hyphens, however.

这个组合的regex工作,因为捕获组1、2和3对组4、5和6是相互排斥的。然而,额外的连字符也有副作用。

Caveat:

* Using -- as a delimiter fails if the input contains consecutive hyphens. All the other "good" delimiters have a similar failure edge case. Only \n is guaranteed not to exist in the input and thus is failsafe.

*使用——如果输入包含连续的连字符,作为分隔符将失败。所有其他“好”分隔符都有类似的失败边缘情况。只有\n保证在输入中不存在,因此是万无一失的。