在Python 2.7.2中使用REGEX检索字符串

时间:2021-08-04 18:16:27

I have the following code snippet from page source:

我有以下来自页面源代码的代码片段:

var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); 

the

'PDFObject('

is unique on the page. I want to retreive url content using REGEX. In this case I need to get

在页面上是唯一的。我想要使用正则表达式的url内容。在这种情况下,我需要

http://www.site.com/doc55.pdf

Please help.

请帮助。

7 个解决方案

#1


0  

In order to be able to find "something that happens in the line after something else", you need to match things "including the newline". For this you use the (dotall) modifier - a flag added during the compilation.

为了能够找到“行中发生的事情”,您需要匹配“包括换行”的内容。为此,您使用(dotall)修饰符——编译期间添加的标志。

Thus the following code works:

因此下面的代码可以工作:

import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''

print r.findall(s)

Explanation:

解释:

r = re.compile(         compile regular expression
    r'                  treat this string as a regular expression
    (?<=PDFObject)      the match I want happens right after PDFObject
    .*?                 then there may be some other characters...
    url:                followed by the string url:
    .*?                 then match whatever follows until you get to the first instance (`?` : non-greedy match of
    (http:.*?)"         match the string http: up to (but not including) the first "
    ',                  end of regex string, but there's more...
    re.DOTALL)          set the DOTALL flag - this means the dot matches all characters
                        including newlines. This allows the match to continue from one line
                        to the next in the .*? right after the lookbehind

#2


3  

Here is an alternative for solving your problem without using regex:

这里有一个不用regex就能解决问题的替代方案:

url,in_object = None, False
with open('input') as f:
    for line in f:
        in_object = in_object or 'PDFObject(' in line
        if in_object and 'url:' in line:
            url = line.split('"')[1]
            break
print url

#3


0  

using a combination of look-behind and look-ahead assertions

使用look behind和look forward断言的组合

import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'

#4


0  

This works:

如此:

import re

src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''   

print [m.group(1).strip('"') for m in 
        re.finditer(r'^url:\s*(.*)[\W]$',
        re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]

prints:

打印:

['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']

#5


0  

Regex

new\s+PDFObject\(\{\s*url:\s*"[^"]+"

在Python 2.7.2中使用REGEX检索字符串

Demo

Extract url only

只提取url

#6


0  

If 'PDFObject(' is the unique identifier in the page, you only have to match the first next quoted content.

如果“PDFObject”是页面中的唯一标识符,则只需匹配下一个引用的内容。

Using the DOTALL flag (re.DOTALL or re.S) and the non-greedy star (*?), you can write:

使用DOTALL标志(re.DOTALL或re.S)和非贪婪星(*?),您可以写:

import re

snippet = '''                                    
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder");
'''

# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)

# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)

RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'

If you don't want to compile your regex because it's used once, simply this syntax:

如果您不想编译regex,因为它只使用一次,那么只需使用以下语法:

re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')

Four choices, one should match you need and taste!

四个选择,一个应该匹配你的需要和品味!

#7


0  

Although the other answers may appear to work, most do not take into account that the only unique thing on the page is 'PDFObject('. A much better regular expression would be the following:

虽然其他的答案似乎是可行的,但是大多数都没有考虑到页面上唯一的唯一的东西是“PDFObject(”)。一个更好的正则表达式是:

PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",

It takes into account that 'PDFObject(' is unique and contains some basic URL verification.

它考虑到“PDFObject(”是唯一的,并包含一些基本的URL验证。

Below is an example of how this regex could be used in python

下面是如何在python中使用这个regex的示例

>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
...   id: "pdfObjectContainer",
...   width: "100%",
...   height: "700px",
...   pdfOpenParams: {
...     navpanes: 0,
...     statusbar: 1,
...     toolbar: 1,
...     view: "FitH"
...   }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'

A pure python (no regex) alternative would be:

一个纯粹的python(没有regex)替代方案是:

>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'

No regex oneliner:

没有regex oneliner:

>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'

#1


0  

In order to be able to find "something that happens in the line after something else", you need to match things "including the newline". For this you use the (dotall) modifier - a flag added during the compilation.

为了能够找到“行中发生的事情”,您需要匹配“包括换行”的内容。为此,您使用(dotall)修饰符——编译期间添加的标志。

Thus the following code works:

因此下面的代码可以工作:

import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''

print r.findall(s)

Explanation:

解释:

r = re.compile(         compile regular expression
    r'                  treat this string as a regular expression
    (?<=PDFObject)      the match I want happens right after PDFObject
    .*?                 then there may be some other characters...
    url:                followed by the string url:
    .*?                 then match whatever follows until you get to the first instance (`?` : non-greedy match of
    (http:.*?)"         match the string http: up to (but not including) the first "
    ',                  end of regex string, but there's more...
    re.DOTALL)          set the DOTALL flag - this means the dot matches all characters
                        including newlines. This allows the match to continue from one line
                        to the next in the .*? right after the lookbehind

#2


3  

Here is an alternative for solving your problem without using regex:

这里有一个不用regex就能解决问题的替代方案:

url,in_object = None, False
with open('input') as f:
    for line in f:
        in_object = in_object or 'PDFObject(' in line
        if in_object and 'url:' in line:
            url = line.split('"')[1]
            break
print url

#3


0  

using a combination of look-behind and look-ahead assertions

使用look behind和look forward断言的组合

import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'

#4


0  

This works:

如此:

import re

src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''   

print [m.group(1).strip('"') for m in 
        re.finditer(r'^url:\s*(.*)[\W]$',
        re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]

prints:

打印:

['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']

#5


0  

Regex

new\s+PDFObject\(\{\s*url:\s*"[^"]+"

在Python 2.7.2中使用REGEX检索字符串

Demo

Extract url only

只提取url

#6


0  

If 'PDFObject(' is the unique identifier in the page, you only have to match the first next quoted content.

如果“PDFObject”是页面中的唯一标识符,则只需匹配下一个引用的内容。

Using the DOTALL flag (re.DOTALL or re.S) and the non-greedy star (*?), you can write:

使用DOTALL标志(re.DOTALL或re.S)和非贪婪星(*?),您可以写:

import re

snippet = '''                                    
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder");
'''

# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)

# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)

RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'

If you don't want to compile your regex because it's used once, simply this syntax:

如果您不想编译regex,因为它只使用一次,那么只需使用以下语法:

re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')

Four choices, one should match you need and taste!

四个选择,一个应该匹配你的需要和品味!

#7


0  

Although the other answers may appear to work, most do not take into account that the only unique thing on the page is 'PDFObject('. A much better regular expression would be the following:

虽然其他的答案似乎是可行的,但是大多数都没有考虑到页面上唯一的唯一的东西是“PDFObject(”)。一个更好的正则表达式是:

PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",

It takes into account that 'PDFObject(' is unique and contains some basic URL verification.

它考虑到“PDFObject(”是唯一的,并包含一些基本的URL验证。

Below is an example of how this regex could be used in python

下面是如何在python中使用这个regex的示例

>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
...   id: "pdfObjectContainer",
...   width: "100%",
...   height: "700px",
...   pdfOpenParams: {
...     navpanes: 0,
...     statusbar: 1,
...     toolbar: 1,
...     view: "FitH"
...   }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'

A pure python (no regex) alternative would be:

一个纯粹的python(没有regex)替代方案是:

>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'

No regex oneliner:

没有regex oneliner:

>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'