【hyperscan】示例解读 pcapscan

示例位置: <hyperscan source>/examples/pcapscan.cc
参考：http://01org.github.io/hyperscan/dev-reference/api_files.html

1. 概述

此示例实现一个简单的数据包匹配性能测量程序。

pcapscan使用libpcap从pcap文件中读取数据包，并根据一个规则文件中指定的多个正则表达式对报文进行匹配，并输出匹配结果和一些统计信息。pcapscan使用并对比了两种匹配模式：BLOCK和STREAM。BLOCK模式时它对单个数据包进行匹配；而STREAM模式下它通过五元组将数据包进行简单分流，并对每条流中的数据进行匹配。STREAM模式可以命中跨越数据包边界的匹配数据（比如，要匹配abc，而a在前一个数据的末尾，而bc在后一个数据包的前端，这两个数据包在一个流中，那么STREAM模式匹配可以命中它，而BLOCK模式不能）。

此示例演示了以下hyperscan概念：

多个模式的编译
与simplegrep示例不同，pcapgrep读取并编译规则文件中的多个正则表达式。编译好的database在运行时可以并行匹配所有模式（而不是一次scan匹配一个）
流模式匹配
包括流状态数据的构造以及流模式下匹配回调函数的用法

2. 源码解读

下面按照代码执行的先后顺序对pcapscan源码进行简单解读。

2.1 编译

函数buildDatabase用来编译规则文件中的多个正则表达式，参数mode指定了是BLOCK还是STREAM模式。

static hs_database_t *buildDatabase(const vector<const char *> &expressions,

                                    const vector<unsigned> flags,

                                    const vector<unsigned> ids,

                                    unsigned int mode) {

    hs_database_t *db;

    hs_compile_error_t *compileErr;

    hs_error_t err;

    Clock clock;

    clock.start();

    err = hs_compile_multi(expressions.data(), flags.data(), ids.data(),

                           expressions.size(), mode, nullptr, &db, &compileErr);


    clock.stop();

    if (err != HS_SUCCESS) {

        if (compileErr->expression < ) {

            // The error does not refer to a particular expression.

            cerr << "ERROR: " << compileErr->message << endl;

        } else {

            cerr << "ERROR: Pattern '" << expressions[compileErr->expression]

                 << "' failed compilation with error: " << compileErr->message

                 << endl;

        }

        // As the compileErr pointer points to dynamically allocated memory, if

        // we get an error, we must be sure to release it. This is not

        // necessary when no error is detected.

        hs_free_compile_error(compileErr);

        exit(-);

    }
//...
}

其中的核心代码是hs_compile_multi的调用，此函数用来编译多个正则表达式，从代码可见除了mode参数，BLOCK和STREAM模式都使用这一API。它的原型是

hs_error_t hs_compile_multi(const char *const * expressions,

                            const unsigned int * flags,

                            const unsigned int * ids,

                            unsigned int elements,

                            unsigned int mode,

                            const hs_platform_info_t * platform,

                            hs_database_t ** db,

                            hs_compile_error_t ** error)

其中，expressions是多个正则表达式字符串，flags和ids分别是expressions对应的flag和id数组；elements是表达式字符串的个数；其余参数与上一个例子中提到的hs_compile的参数涵义相同。

这里要注意的一个事情是参数ids，它是正则表达式的ID数组。每个表达式都有一个唯一ID，这样命中的时候匹配回调函数可以得到此ID，告诉调用者哪个表达式命中了。如果ids传入NULL，则所有表达式的ID都为0。

2.2 准备匹配临时数据

Benchmark构造函数中，为接下来的匹配分配足够的临时数据空间(scratch space）。这里有一个技巧：1）BLOCK和STREAM模式的匹配只需共用一个scratch；2）这个scratch足够大，方法是调用两次，在第2次调用时hyperscan如果发现空间不够会进行增加。

public:

    Benchmark(const hs_database_t *streaming, const hs_database_t *block)

        : db_streaming(streaming), db_block(block), scratch(nullptr),

          matchCount() {

        // Allocate enough scratch space to handle either streaming or block

        // mode, so we only need the one scratch region.
hs_error_t err = hs_alloc_scratch(db_streaming, &scratch);
if (err != HS_SUCCESS) {

            cerr << "ERROR: could not allocate scratch space. Exiting." << endl;

            exit(-);

        }

        // This second call will increase the scratch size if more is required

        // for block mode.
err = hs_alloc_scratch(db_block, &scratch);
if (err != HS_SUCCESS) {

            cerr << "ERROR: could not allocate scratch space. Exiting." << endl;

            exit(-);

        }

    }

2.3 读取数据包、分流

在Benchmark::readStreams方法中，从pcap文件中读取了所有数据包（其实封装必须是ethernet-ipv4-tcp/udp），并根据五元组进行简单分流。主要代码如下

while ((pktData = pcap_next(pcapHandle, &pktHeader)) != nullptr) {

            unsigned int offset = , length = ;

            if (!payloadOffset(pktData, &offset, &length)) {

                continue;

            }

            // Valid TCP or UDP packet

            const struct ip *iphdr = (const struct ip *)(pktData

                    + sizeof(struct ether_header));

            const char *payload = (const char *)pktData + offset;

            size_t id = stream_map.insert(std::make_pair(FiveTuple(iphdr),

                                          stream_map.size())).first->second;

            packets.push_back(string(payload, length));
stream_ids.push_back(id);

        }

注意，stream_ids这个vector存储了每一个数据包对应的stream id。

2.4 打开流

由于需要用到STREAM模式，所以在匹配前要先将流打开，见Benchmark::openStreams

// Open a Hyperscan stream for each stream in stream_ids

    void openStreams() {

        streams.resize(stream_map.size());

        for (auto &stream : streams) {
hs_error_t err = hs_open_stream(db_streaming, , &stream);
if (err != HS_SUCCESS) {

                cerr << "ERROR: Unable to open stream. Exiting." << endl;

                exit(-);

            }

        }

    }

其中，streams的类型是vector<hs_stream_t *>。

2.5 匹配

2.5.1 STREAM模式

在Benchmark::scanStreams中

// Scan each packet (in the ordering given in the PCAP file) through

    // Hyperscan using the streaming interface.

    void scanStreams() {

        for (size_t i = ; i != packets.size(); ++i) {

            const std::string &pkt = packets[i];
hs_error_t err = hs_scan_stream(streams[stream_ids[i]],

                                            pkt.c_str(), pkt.length(), ,

                                            scratch, onMatch, &matchCount);
if (err != HS_SUCCESS) {

                cerr << "ERROR: Unable to scan packet. Exiting." << endl;

                exit(-);

            }

        }

    }

hs_scan_stream的原型：

hs_error_t hs_scan_stream(hs_stream_t * id,

                          const char * data,

                          unsigned int length,

                          unsigned int flags,

                          hs_scratch_t * scratch,

                          match_event_handler onEvent,

                          void * ctxt)

其中，id是数据所属的stream对应hs_stream_t指针，这里叫id其实我感觉不太合适; 其余参数与hs_scan相同。

这里调用的streams[stream_ids[i]]已经在上一步打开流中初始化。

2.5.2 BLOCK模式

BLOCK模式比STREAM简单许多，在Benchmark::scanBlock中

// Scan each packet (in the ordering given in the PCAP file) through

    // Hyperscan using the block-mode interface.

    void scanBlock() {

        for (size_t i = ; i != packets.size(); ++i) {

            const std::string &pkt = packets[i];
  hs_error_t err = hs_scan(db_block, pkt.c_str(), pkt.length(), ,

                                     scratch, onMatch, &matchCount);
if (err != HS_SUCCESS) {

                cerr << "ERROR: Unable to scan packet. Exiting." << endl;

                exit(-);

            }

        }

    }

hs_scan在解读simple中已经说过了，不再赘述。

2.6 清理资源

包括关闭流（hs_close_stream）、释放database等。这里要注意hs_close_stream时仍会进行匹配。

3. STREAM模式总结

STREAM模式的用法比BLOCK模式要复杂一些，这里简单用伪代码总结一下

// N是流的规格，事先已确定好

hs_database_t* db;

hs_stream_t*  steams[N];

hs_scratch_t* tmp;

uint8_t* pkt;

// 1) 编译多个正则表达式

hs_compile_multi(&db, HS_MODE_STREAM);

// 2) 准备scratch

hs_alloc_scratch(db, &tmp);

// 3) 打开流

for(i=; i<N; i++)

  hs_open_stream(db, &streams[i]);

// 4) 收到数据包，并将其分到指定流

stream_id = classify(pkt);

// 5) 流匹配

hs_scan_stream(streams[stream_id], pkt, &tmp, callBack);

// 6) 清理资源, 注意hs_close_stream仍可能有匹配

for(i=; i<N; i++)

  hs_close_stream(db, streams[i], &tmp, callBack);
hs_free_scrach(tmp);

hs_free_database(db);

可以通过hs_database_size()和hs_stream_size()分别获得database和每条流的stream state的大小。正则表达式的数目和复杂度会影响stream state的大小，随着数目和复杂度的增加，可能会越来越大。在支持上百万条流和复杂规则文件的系统上，stream state的内存耗费可能很大。

4. 编译运行

运行示例前要准备一个pcap文件和一个规则文件，规则文件的格式如

：/weibo/

：/[f|F]ile/

每行一个正则表达式，冒号前面是表达式的ID，后面是pcre正则表达式。

以下是编译和运行截图，我用了一个微博流量的pcap，并匹配其中的weibo关键字：

zzq@ubuntu14:~/hs_demo$ g++ -o pcapscan pcapscan.cc -std=c++ -lhs -lpcap 

zzq@ubuntu14:~/hs_demo$ ./pcapscan ptn weibo.pcap

Pattern file: ptn

Compiling Hyperscan databases with  patterns.

Hyperscan streaming mode database compiled in 0.000236959 seconds.

Hyperscan block mode database compiled in 4.8277e-05 seconds.

PCAP input file: weibo.pcap

 packets in  streams, totalling  bytes.

Average packet length:  bytes.

Average stream length:  bytes.

Streaming mode Hyperscan database size    :  bytes.

Block mode Hyperscan database size        :  bytes.

Streaming mode Hyperscan stream state size:  bytes (per stream).

Streaming mode:

  Total matches:

  Match rate:    2.5312 matches/kilobyte

  Throughput (with stream overhead): 2576.33 megabits/sec

  Throughput (no stream overhead):   5444.49 megabits/sec

Block mode:

  Total matches:

  Match rate:    2.5312 matches/kilobyte

  Throughput:    16227.30 megabits/sec

WARNING: Input PCAP file is less than 2MB in size.

This test may have been too short to calculate accurate results.

秒客网