正则表达式,用于重新排序字段中的字符串

时间:2022-10-19 09:59:27

I am trying to write a program with regular expressions to clean up some data. Let's say I have room names with a letter and a number. In the final output I need to output the room names using the pattern "the full string (excluding letter & number) + letter + number" as in the examples below. However, with the regular expressions I've written so far, I get very messed up results, which are at the bottom of my message. For some reason, it puts letters and characters on some of the rows, even though there may be none in the input data. Thank you.

我正在尝试编写一个带有正则表达式的程序来清理一些数据。假设我有一个带有字母和数字的房间名称。在最终输出中,我需要使用“完整字符串(不包括字母和数字)+字母+数字”模式输出房间名称,如下例所示。然而,到目前为止我写的正则表达式,我得到了非常混乱的结果,这是我的消息的底部。由于某种原因,它会在某些行上放置字母和字符,即使输入数据中可能没有。谢谢。

EDITED: I made edits to the input data. I would like to generalize the code to take any number of character strings, not just the single word "ROOM".

编辑:我对输入数据进行了编辑。我想概括代码来获取任意数量的字符串,而不仅仅是单词“ROOM”。

# the pattern should be "the full string (excluding letter & number) + letter + number". For example:
ATLANTA ROOM
ATLANTA ROOM 3
NEW YORK ROOM A 2
ROOM A 4
THE BIG AWESOME ROOM B
ROOM B 4
GEORGETOWN ROOM B 2
NEW YORK ROOM C 2
NEW YORK ROOM C
LOS ANGELES ROOM E 2

# program to clean with regular expressions. there could be multiple spaces between words
dd <- c("ATLANTA ROOM ",
    " ATLANTA ROOM  3",
    "NEW YORK A ROOM   2",
    "4 ROOM A",
    "THE BIG AWESOME ROOM B",
    " ROOM 4 B",
    "GEORGETOWN B 2 ROOM ",
    " C NEW YORK ROOM 2",
    "NEW YORK ROOM C",
    "LOS ANGELES ROOM 2  E")

m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)

(dd2 <- paste(gsub("( +)", " ",
                   gsub("(^ +)|( +$)", "",
                        gsub("(\\<A|B|C|D|E|1|2|3|4\\>)", "", dd))),
              regmatches(dd, m_char), regmatches(dd, m_num), sep = " "))

# actual output from the program
"TLANTA ROOMA3",
"TLANTA ROOMA2",
"NW YORK ROOMA4",
"ROOMA4", 
"TH IG WSOM ROOME2",
"ROOMB2",
"GORGTOWN ROOMB2",
"NW YORK ROOMC3", 
"NW YORK ROOMC2",
"LOS NGLS ROOMA4"

3 个解决方案

#1


4  

Here's an attempt:

这是一个尝试:

sub(' $', '', # clean up spaces at the end
    gsub(' +', ' ', # clean up double spaces
         # rearrange letter and numbers
         sub('^([A-Z]?)([0-9]*)([A-Z]?)$', 'ROOM \\1\\3 \\2',
             gsub(' |ROOM', '', dd)    # remove spaces and ROOM
            )
        )
   )
#[1] "ROOM"     "ROOM 3"   "ROOM A 2" "ROOM A 4" "ROOM B"   "ROOM B 4" "ROOM B 2"
#[8] "ROOM C 2" "ROOM C"   "ROOM E 2"

And here's the same logic for the edited OP and comment below (assuming room names are words that have at least 3 letters and at most a 2-letter room designation):

以下是编辑后的OP和评论的相同逻辑(假设房间名称是至少包含3个字母且最多为2个字母的房间名称的单词):

gsub('(^ | $)', '', # clean up spaces in front or end
     gsub(' +', ' ', # clean up double spaces
          # extract room name and put it in front of the letter and number
          paste(gsub('\\b([A-Z][A-Z]?|[0-9]+)\\b', '', dd, perl = T),
                sub('^([A-Z]+)?([0-9]*)([A-Z]+)?$', '\\1\\3 \\2',
                    gsub(' |\\w\\w\\w+', '', dd)    # remove spaces and words
                   )
               )
         )
    )

#2


2  

So, what's happening is e.g. your program only 8 letters, and so instead of inserting "" or NA, it's recycling them.

所以,正在发生的事情是你的程序只有8个字母,所以不是插入“”或NA,而是回收它们。

Here is a fix:

这是一个修复:

m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)

numbers <- rep("", length(dd))
numbers[m_num>0] <- regmatches(dd, m_num)

letters <- rep("", length(dd))
letters[m_char>0] <- regmatches(dd, m_char)

output <- trim(paste("ROOM", letters, numbers))

[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2" "ROOM C 2" "ROOM C"
[10] "ROOM E 2"

[1]“房间”“房间3”“房间A 2”“房间4”“房间B”“房间B 4”“房间B 2”“房间C 2”“房间C”[10]“房间E 2 “

#3


0  

Try this:

library(gsubfn)

# extract numbers (num) and room letters (char)
num <- sapply(strapplyc(dd, "\\d|$"), paste, collapse = "")
char <- sapply(strapplyc(dd, "[A-D]|$"), paste, collapse = "")

# put back together and sort
out <- sort(paste("ROOM", char, num))

# trim spaces (optional)
out <- gsub(" +", " ", sub(" *$", "", out))

> out
 [1] "ROOM"     "ROOM 2"   "ROOM 3"   "ROOM A 2" "ROOM A 4" "ROOM B"  
 [7] "ROOM B 2" "ROOM B 4" "ROOM C"   "ROOM C 2"

UPDATE: minor improvements

更新:小改进

#1


4  

Here's an attempt:

这是一个尝试:

sub(' $', '', # clean up spaces at the end
    gsub(' +', ' ', # clean up double spaces
         # rearrange letter and numbers
         sub('^([A-Z]?)([0-9]*)([A-Z]?)$', 'ROOM \\1\\3 \\2',
             gsub(' |ROOM', '', dd)    # remove spaces and ROOM
            )
        )
   )
#[1] "ROOM"     "ROOM 3"   "ROOM A 2" "ROOM A 4" "ROOM B"   "ROOM B 4" "ROOM B 2"
#[8] "ROOM C 2" "ROOM C"   "ROOM E 2"

And here's the same logic for the edited OP and comment below (assuming room names are words that have at least 3 letters and at most a 2-letter room designation):

以下是编辑后的OP和评论的相同逻辑(假设房间名称是至少包含3个字母且最多为2个字母的房间名称的单词):

gsub('(^ | $)', '', # clean up spaces in front or end
     gsub(' +', ' ', # clean up double spaces
          # extract room name and put it in front of the letter and number
          paste(gsub('\\b([A-Z][A-Z]?|[0-9]+)\\b', '', dd, perl = T),
                sub('^([A-Z]+)?([0-9]*)([A-Z]+)?$', '\\1\\3 \\2',
                    gsub(' |\\w\\w\\w+', '', dd)    # remove spaces and words
                   )
               )
         )
    )

#2


2  

So, what's happening is e.g. your program only 8 letters, and so instead of inserting "" or NA, it's recycling them.

所以,正在发生的事情是你的程序只有8个字母,所以不是插入“”或NA,而是回收它们。

Here is a fix:

这是一个修复:

m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)

numbers <- rep("", length(dd))
numbers[m_num>0] <- regmatches(dd, m_num)

letters <- rep("", length(dd))
letters[m_char>0] <- regmatches(dd, m_char)

output <- trim(paste("ROOM", letters, numbers))

[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2" "ROOM C 2" "ROOM C"
[10] "ROOM E 2"

[1]“房间”“房间3”“房间A 2”“房间4”“房间B”“房间B 4”“房间B 2”“房间C 2”“房间C”[10]“房间E 2 “

#3


0  

Try this:

library(gsubfn)

# extract numbers (num) and room letters (char)
num <- sapply(strapplyc(dd, "\\d|$"), paste, collapse = "")
char <- sapply(strapplyc(dd, "[A-D]|$"), paste, collapse = "")

# put back together and sort
out <- sort(paste("ROOM", char, num))

# trim spaces (optional)
out <- gsub(" +", " ", sub(" *$", "", out))

> out
 [1] "ROOM"     "ROOM 2"   "ROOM 3"   "ROOM A 2" "ROOM A 4" "ROOM B"  
 [7] "ROOM B 2" "ROOM B 4" "ROOM C"   "ROOM C 2"

UPDATE: minor improvements

更新:小改进