获取正则表达式从非结构化文本返回数字范围(例如工资)

时间:2023-01-16 16:40:39

I'm trying to extract salary information from emails and job adverts.

我正在尝试从电子邮件和招聘广告中提取薪资信息。

I need a regex that will return the first instance of the range or salary number (I also want to avoid matching phone numbers that might occur later in the string) e.g.

我需要一个正则表达式,它将返回范围或工资号的第一个实例(我还想避免匹配稍后可能出现在字符串中的电话号码),例如

"blah blah £500 blah".match(regex)
=> "£500"
"balh blah £500-650 blah".match(regex)
=> "£500-650"
"£50 per hour".match(regex) 
=> "£50" (or "£50 per hour" for an advanced version)
"blah blah £50k blah".match(regex)
=> "£50k"
"bblah blah 50-60k".match(regex)
=> "50-60k"
"blah blah 50000 blahblahblah".match(regex)  
=> "50000"
"blah 50000 - 60000 blablahblah 0207-123-4567".match(regex) 
=> "50000 - 60000"
#"blah 350 to 425 blah".match(regex) 
#=> "350 to 425" Can forget this last one as it's a bit of an edge case

I've got this far.

我到目前为止。

/(£|$)?[0-9][0-9]{0,5}[0-9,k]?(-| - | to )?((£|$)?[0-9][0-9]{0,5}[0,5,k]?)?/

Or a slightly enhanced:

或稍微增强:

/(£|$)?[0-9][0-9]{0,5}[0-9,k]?(-| - | to )?((£|$)?[0-9][0-9][0,5,k]?)? ?(ph|pw|pa| per|per)? ?(hour|annum|week|month)?/

Sort of works but doesn't match the whole substring number range, there seems to be lots of matches of the individual pieces.

工作类型但与整个子字符串数范围不匹配,似乎有很多匹配的单个部分。

e.g.

"bblah blah 50-60k".match(regex)
=> #<MatchData "50-60k" 1:nil 2:"-" 3:"60k" 4:nil>

I want it to just say

我想要它只是说

"50-60k"

What am I missing (and is there a more elegant way to do this?)

我错过了什么(有更优雅的方式来做这件事吗?)

2 个解决方案

#1


2  

I concluded it would be best to do this in two steps. First, let's remove phone numbers:

我总结说最好分两步完成。首先,我们删除电话号码:

r0 = /
     \d+        # Match one or more digits
     (?:        # Begin a non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+      # Match one or more digits
     ){2}       # Close non-capture and perform it twice
    /x          # Extended/free-spacing mode

arr0 = ["blah blah £500 blah",
        "balh blah £500-650 blah",
        "£50 per hour",
        "blah blah £50k blah",
        "bblah blah 50-60k",
        "blah blah 50000 blahblahblah",
        "blah 50000 - £60000 blablahblah 0207-123-4567",
        "bblah blah 50k-£60k"
       ]

arr1 = arr0.map { |str| str.gsub(r0,'') }
  #=> ["blah blah £500 blah",
  #    "balh blah £500-650 blah",
  #    "£50 per hour",
  #    "blah blah £50k blah",
  #    "bblah blah 50-60k",
  #    "blah blah 50000 blahblahblah",
  #    "blah 50000 - £60000 blablahblah ",
  #    "bblah blah 50k-£60k"] 

I've assumed that all phone numbers are three strings of digits separated by hyphens that are optionally surrounded with spaces. If that assumption is not correct you would of course have to modify r0 as appropriate.

我假设所有电话号码都是由连字符分隔的三个数字串,这些连字符可选地用空格包围。如果该假设不正确,您当然必须适当修改r0。

Now extract the values desired from the elements of arr1:

现在从arr1的元素中提取所需的值:

r1 = /
     £?         # Optionally begin with a pound sign
     \d+k?      # Match one or more digits optionally followed by k
     (?:        # Begin non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+k?    # Match one or more digits optionally followed by k
     )?         # End non-capture group and make the match optional
     \b         # word break
     /x         # Extended/free-spacing mode

arr1.map { |s| s[r1] }
  #=> ["£500", "£500-650", "£50", "£50k", "50-60k", "50000", "50000", "50k"] 

#2


0  

Here is the new update.

这是新的更新。

((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w)|(?<=\s)\d{1,}(?=\s))

This should work for all scenarios.

这适用于所有场景。

Here is one that includes the "per hour"

这是包含“每小时”的一个

((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w| per hour)|(?<=\s)\d{1,}(?=\s))

#1


2  

I concluded it would be best to do this in two steps. First, let's remove phone numbers:

我总结说最好分两步完成。首先,我们删除电话号码:

r0 = /
     \d+        # Match one or more digits
     (?:        # Begin a non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+      # Match one or more digits
     ){2}       # Close non-capture and perform it twice
    /x          # Extended/free-spacing mode

arr0 = ["blah blah £500 blah",
        "balh blah £500-650 blah",
        "£50 per hour",
        "blah blah £50k blah",
        "bblah blah 50-60k",
        "blah blah 50000 blahblahblah",
        "blah 50000 - £60000 blablahblah 0207-123-4567",
        "bblah blah 50k-£60k"
       ]

arr1 = arr0.map { |str| str.gsub(r0,'') }
  #=> ["blah blah £500 blah",
  #    "balh blah £500-650 blah",
  #    "£50 per hour",
  #    "blah blah £50k blah",
  #    "bblah blah 50-60k",
  #    "blah blah 50000 blahblahblah",
  #    "blah 50000 - £60000 blablahblah ",
  #    "bblah blah 50k-£60k"] 

I've assumed that all phone numbers are three strings of digits separated by hyphens that are optionally surrounded with spaces. If that assumption is not correct you would of course have to modify r0 as appropriate.

我假设所有电话号码都是由连字符分隔的三个数字串,这些连字符可选地用空格包围。如果该假设不正确,您当然必须适当修改r0。

Now extract the values desired from the elements of arr1:

现在从arr1的元素中提取所需的值:

r1 = /
     £?         # Optionally begin with a pound sign
     \d+k?      # Match one or more digits optionally followed by k
     (?:        # Begin non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+k?    # Match one or more digits optionally followed by k
     )?         # End non-capture group and make the match optional
     \b         # word break
     /x         # Extended/free-spacing mode

arr1.map { |s| s[r1] }
  #=> ["£500", "£500-650", "£50", "£50k", "50-60k", "50000", "50000", "50k"] 

#2


0  

Here is the new update.

这是新的更新。

((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w)|(?<=\s)\d{1,}(?=\s))

This should work for all scenarios.

这适用于所有场景。

Here is one that includes the "per hour"

这是包含“每小时”的一个

((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w| per hour)|(?<=\s)\d{1,}(?=\s))