除了使用正则表达式之外,在Swift中解析HTML

时间:2022-06-21 00:34:30

Below is the HTML code that I want to parse through in Swift:

下面是我想在Swift中解析的HTML代码:

<td class="pinyin">
<a href="rsc/audio/voice_pinyin_pz/yi1.mp3">
<span class="mpt1">yī</span></a> 
<a href="rsc/audio/voice_pinyin_pz/yan3.mp3">
<span class="mpt3">yǎn</span>
</a>
</td>

I have read that Regex is not a good way to parse through HTML but nevertheless I have written an expression that capture what I want (which are the letters between the span): and yǎn

我已经读过Regex不是一个通过HTML解析的好方法,但是我写了一个表达式来捕捉我想要的东西(这是跨度之间的字母):yī和yǎn

Regex expression:

/pinyin.+<span.+>(.+)<\/.+<span.+>(.+)<\//Us

I was wondering how to implement it in so that I can capture both and yǎn at the same time and save it into an array. Also, I was wondering if there is another way that I would be able to do this without Regex.

我想知道如何实现它,以便我可以同时捕获yī和yǎn并将其保存到数组中。此外,我想知道是否有另一种方式,我可以没有正则表达式这样做。

EDIT:

I ended up using TFHpple as suggested by Rob. Although I did take a long time to figure out how to import it into Swift so I thought it would be helpful to post it here for convenience:

我最终按照Rob的建议使用了TFHpple。虽然我确实花了很长时间才弄清楚如何将它导入Swift,所以我认为为方便起见将它发布在这里会很有帮助:

1. Open your project and drag the TFHpple files into it

1.打开项目并将TFHpple文件拖入其中

2. At this point XCode will probably prompt you to create a bridging-header class file if you haven't included any Obj-C code in your current project. In this bridging-header file you should add:

2.此时,如果您当前项目中未包含任何Obj-C代码,XCode可能会提示您创建桥接头类文件。在这个桥接头文件中,您应该添加:

#import <Foundation/Foundation.h>
#import "TFHpple.h"
#import "TFHppleElement.h"

3. Select the target, under General, in Linked Frameworks and Libraries (just scroll down when you are in the General tab and you will see it, add libxml2.2.dylib and libxml2.dylib

3.在“常规”下的“链接的框架和库”中选择目标(只需在“常规”选项卡中向下滚动,您将看到它,添加libxml2.2.dylib和libxml2.dylib

4. Under Build Settings, in Header Search Paths, add $(SDKROOT)/usr/include/libxml2 WARNING: be sure that it isn't User Header Search Paths as this is not the same

4.在“构建设置”下的“页眉搜索路径”中,添加$(SDKROOT)/ usr / include / libxml2警告:确保它不是用户页眉搜索路径,因为这不一样

5. Under Build Settings, in Other Linker Flags, add -lxml2

5.在Build Settings,在Other Linker Flags中,添加-lxml2

Enjoy!

2 个解决方案

#1


6  

You can use the typical iOS HTML parser, TFHpple:

您可以使用典型的iOS HTML解析器TFHpple:

let data = NSData(contentsOfFile: path)
let doc = TFHpple(HTMLData: data)
if let elements = doc.searchWithXPathQuery("//td[@class='pinyin']/a/span") as? [TFHppleElement] {
    for element in elements {
        println(element.content)
    }
}

Or you can use NDHpple:

或者您可以使用NDHpple:

let data = NSData(contentsOfFile: path)!
let html = NSString(data: data, encoding: NSUTF8StringEncoding)!
let doc = NDHpple(HTMLData: html)
if let elements = doc.searchWithXPathQuery("//td/a/span") {
    for element in elements {
        println(element.children?.first?.content)
    }
}

I have more miles with TFHpple, so I'm personally more comfortable with that. NDHpple seems like it theoretically could be an alternative, though I'm not as crazy about it personally (e.g. why does HTMLData parameter take string and not NSData? why do I have to navigate through children to get contents of //td/a/span results? the [@class='pinyin'] qualifier doesn't appear to work, etc.). But, try both and see which you prefer.

我有更多的TFHpple里程,所以我个人对此更加满意。 NDHpple似乎在理论上可能是另一种选择,虽然我并没有像个人那样疯狂(例如为什么HTMLData参数需要字符串而不是NSData?为什么我必须通过子项导航才能获得// td / a /的内容span结果?[@ class ='pinyin']限定符似乎不起作用,等等。但是,试试两个,看看你喜欢哪个。

Both require bridging header: TFHpple requires TFHpple.h in the bridging header, NDHpple requires the libxml headers there. See the documentation for each for more information.

两者都需要桥接头:TFHpple在桥接头中需要TFHpple.h,NDHpple需要libxml头。有关详细信息,请参阅每个文档。

#2


1  

As you've said, you shouldn't use regex to parse HTML, it will go wrong (obligatory link). Just wrap within another <span> and you'll see why.

正如您所说,您不应该使用正则表达式来解析HTML,它会出错(强制性链接)。只需将yī包裹在另一个中,你就会明白为什么。

Instead, you should use a full-blown HTML parser. Make sure to check out How to Parse HTML on iOS for a detailed tutorial.

相反,您应该使用完整的HTML解析器。请务必查看如何在iOS上解析HTML以获取详细教程。

#1


6  

You can use the typical iOS HTML parser, TFHpple:

您可以使用典型的iOS HTML解析器TFHpple:

let data = NSData(contentsOfFile: path)
let doc = TFHpple(HTMLData: data)
if let elements = doc.searchWithXPathQuery("//td[@class='pinyin']/a/span") as? [TFHppleElement] {
    for element in elements {
        println(element.content)
    }
}

Or you can use NDHpple:

或者您可以使用NDHpple:

let data = NSData(contentsOfFile: path)!
let html = NSString(data: data, encoding: NSUTF8StringEncoding)!
let doc = NDHpple(HTMLData: html)
if let elements = doc.searchWithXPathQuery("//td/a/span") {
    for element in elements {
        println(element.children?.first?.content)
    }
}

I have more miles with TFHpple, so I'm personally more comfortable with that. NDHpple seems like it theoretically could be an alternative, though I'm not as crazy about it personally (e.g. why does HTMLData parameter take string and not NSData? why do I have to navigate through children to get contents of //td/a/span results? the [@class='pinyin'] qualifier doesn't appear to work, etc.). But, try both and see which you prefer.

我有更多的TFHpple里程,所以我个人对此更加满意。 NDHpple似乎在理论上可能是另一种选择,虽然我并没有像个人那样疯狂(例如为什么HTMLData参数需要字符串而不是NSData?为什么我必须通过子项导航才能获得// td / a /的内容span结果?[@ class ='pinyin']限定符似乎不起作用,等等。但是,试试两个,看看你喜欢哪个。

Both require bridging header: TFHpple requires TFHpple.h in the bridging header, NDHpple requires the libxml headers there. See the documentation for each for more information.

两者都需要桥接头:TFHpple在桥接头中需要TFHpple.h,NDHpple需要libxml头。有关详细信息,请参阅每个文档。

#2


1  

As you've said, you shouldn't use regex to parse HTML, it will go wrong (obligatory link). Just wrap within another <span> and you'll see why.

正如您所说,您不应该使用正则表达式来解析HTML,它会出错(强制性链接)。只需将yī包裹在另一个中,你就会明白为什么。

Instead, you should use a full-blown HTML parser. Make sure to check out How to Parse HTML on iOS for a detailed tutorial.

相反,您应该使用完整的HTML解析器。请务必查看如何在iOS上解析HTML以获取详细教程。