在没有JSON库的python中解析JSON对象(只使用regex)

I'm currently building a small application using the Instagram API which replies with JSON "objects" for the GET operations. To get the response I'm currently using urllib2.

我目前正在使用Instagram API构建一个小应用程序，该API使用JSON“对象”来回答GET操作。为了获得响应，我目前使用urllib2。

This is part of an assignment from one of the courses I'm currently attending to, and the biggest challenge is that we are not allowed to use the JSON library to quickly parse and retrieve the information from the instagram response. We are forced to use the regex library (and only that) to properly parse the information.

这是我目前正在参加的一门课程的作业的一部分，最大的挑战是我们不允许使用JSON库快速解析和检索instagram响应中的信息。我们*使用regex库(也只有这个)来正确解析信息。

The instagram response format to obtain the feed page of an user, for example, follows the structure shown in this link.

例如，获取用户提要页面的instagram响应格式遵循这个链接中显示的结构。

I honestly have spent 3 hours trying to figure this out by myself and also tried to obtain information on the internet, but most answered questions always point out to use the JSON library.

老实说，我花了3个小时的时间试图自己弄清楚这个问题，并试图在互联网上获取信息，但大多数的回答总是指向使用JSON库。

Any tips or suggestion would come in handy.

任何提示或建议都会派上用场。

Additionally, other than urllib2 (may be considered external), I am not allowed to use any other external library (more like, 3rd party library) than the ones provided with python 2.7.

此外，除了urllib2(可以认为是外部的)之外，我不允许使用任何其他的外部库(更像第三方库)，而不允许使用python 2.7提供的库。

Thanks in advance.

提前谢谢。

2 个解决方案

#1

It's not that complicated really, when you do the get request, you will get a bunch of code, from which you only need little parts, like for example, if you want to parse the news feeds from an user, and get the images and its captions:

它其实并不复杂，当你做get请求时，你会得到一堆代码，你只需要一小部分，例如，如果你想解析一个用户的新闻提要，得到图像和它的说明:

query = "https://api.instagram.com/v1/users/"+profile_id+"/media/recent?access_token="+token
response = urlopen(query)
the_page = response.read()
feed = {}
feed['images'] = []
feed['captions'] = []
matchImage = re.findall(r'"standard_resolution":{"url":"(.*?)"', the_page)
matchCaption = re.findall(r'"caption":(.*?),(.*?),', the_page)
if len(matchImage) > 0:
    for x in xrange(0,len(matchImage)):
    image = matchImage[x].replace('\\','')
    if matchCaption[x][0] == 'null':
        feed['images'].append(image)
        feed['captions'].append('No Caption')
    else:
        caption = re.search(r'"text":"(.*?)"', matchCaption[x][1])
        caption = caption.group(1).replace('\\','')
        feed['images'].append(image)
        feed['captions'].append(caption)

#2

How about using a functional parser library and a bit of regex?

使用函数解析器库和一些regex怎么样?

def parse(seq):
    'Sequence(Token) -> object'
    ...
    n = lambda s: a(Token('Name', s)) >> tokval
    def make_array(n):
        if n is None:
            return []
        else:
            return [n[0]] + n[1]
    ...
    null = n('null') >> const(None)
    true = n('true') >> const(True)
    false = n('false') >> const(False)
    number = toktype('Number') >> make_number
    string = toktype('String') >> make_string
    value = forward_decl()
    member = string + op_(':') + value >> tuple
    object = (
        op_('{') +
        maybe(member + many(op_(',') + member)) +
        op_('}')
        >> make_object)
    array = (
        op_('[') +
        maybe(value + many(op_(',') + value)) +
        op_(']')
        >> make_array)
    value.define(
          null
        | true
        | false
        | object
        | array
        | number
        | string)
    json_text = object | array
    json_file = json_text + skip(finished)

    return json_file.parse(seq)

You will need the funcparserlib library for this.

为此需要使用funcparserlib库。

Note: Doing this with just pure regex is just too hard. You need to write some kind of "parser" -- So you may as well use a parser library to help with some of the boring bits.

注意:仅使用纯regex进行此操作实在太难了。您需要编写某种“解析器”——因此您不妨使用解析器库来帮助处理一些无聊的部分。

#1