比特币源码情景分析之bloom filter精读

上一篇SPV钱包里utxos同步提到了bloom filter，这一章节我们将从源码分析角度来个深度解剖

Bloom filter基本原理

An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3.

下面的bitarray是一个m位的位数组。

filter的key集合是{x,y,z}, x添加到filter时，算法会将x进行三次不同的hash而生成3个值，将这个值当做bitarray的index,并将对应index的内容置位1, 蓝色的3个箭头代表x的3个index,w验证时同样经过3次hash, 最后会生成3个index,然后从bitarray中查找这3个index的内容，如果都为1，则证明存在，有一个不为1,说明不存在.

hash算法的特点是，相同输入产生固定的输出(index)，不同的输入可能会得到相同的输出(index), 所以bloom filter能完全确定不属于集合的Key值，但是可能会错误的将不属于集合的key值认为是属于集合的。

为了降低错误率，其实就是要降低不同key值再几次hash产生相同输出的概率。Bitmap的长度我们定为m,几次hash我们定义为k. m增大能降低一次hash不同输入产生相同输出的概率，k增大能降低所有hash都相同的概率。所以合适的m和k值对降低错误率很关键.具体怎么选取m, k值有相关的数学公式，大家可以参阅

https://en.wikipedia.org/wiki/Bloom_filter

Bitcoin bloom filter流程

1)load filter (net_processing.cpp)

    else if (strCommand == NetMsgType::FILTERLOAD)
    {
        CBloomFilter filter;
        vRecv >> filter;

        if (!filter.IsWithinSizeConstraints())
        {
            // There is no excuse for sending a too-large filter
            LOCK(cs_main);
            Misbehaving(pfrom->GetId(), 100);
        }
        else
        {
            LOCK(pfrom->cs_filter);
            pfrom->pfilter.reset(new CBloomFilter(filter));
            pfrom->pfilter->UpdateEmptyFull();
            pfrom->fRelayTxes = true;
        }
    }

filter的数据序列化

    template <typename Stream, typename Operation>
    inline void SerializationOp(Stream& s, Operation ser_action) {
        //vData是bloom filter的集合key
        READWRITE(vData);
        //几次hash函数
        READWRITE(nHashFuncs);
        READWRITE(nTweak);
        READWRITE(nFlags);
    }

2)新增filter

    else if (strCommand == NetMsgType::FILTERADD)
    {
        std::vector<unsigned char> vData;
        vRecv >> vData;

        // Nodes must NEVER send a data item > 520 bytes (the max size for a script data object,
        // and thus, the maximum size any matched object can have) in a filteradd message
        bool bad = false;
        if (vData.size() > MAX_SCRIPT_ELEMENT_SIZE) {
            bad = true;
        } else {
            LOCK(pfrom->cs_filter);
            if (pfrom->pfilter) {
                pfrom->pfilter->insert(vData);
            } else {
                bad = true;
            }
        }
        if (bad) {
            LOCK(cs_main);
            Misbehaving(pfrom->GetId(), 100);
        }
    }

其实就是按照bloom filter的算法对新增的key做几次hash然后修改bitArray

void CBloomFilter::insert(const std::vector<unsigned char>& vKey)
{
    if (isFull)
        return;
    //n次不同hash，不代表需要n个不同的hash函数，直接根据index更改hash seed即可实现
    for (unsigned int i = 0; i < nHashFuncs; i++)
    {
        unsigned int nIndex = Hash(i, vKey);
        // Sets bit nIndex of vData
        vData[nIndex >> 3] |= (1 << (7 & nIndex));
    }
    isEmpty = false;
}

上面的 vData[nIndex >> 3] |= (1 << (7 & nIndex)); 每一次key hash生成的结果对应到bitArray的1bit的index, 而vData是char对象，总共有4 bit，所以nIndex >> 3先找到对一个char的index, 1 << (7 & nIndex) 找到index对应4位中的哪一位

class CBloomFilter
{
private:
    std::vector<unsigned char> vData;
    unsigned int nHashFuncs;
    unsigned int nTweak;
}

nHashFuncs是int, 说好的不同的hash函数呢？

inline unsigned int CBloomFilter::Hash(unsigned int nHashNum, const std::vector<unsigned char>& vDataToHash) const
{
    // 0xFBA4C795 chosen as it guarantees a reasonable bit difference between nHashNum values.

    return MurmurHash3(nHashNum * 0xFBA4C795 + nTweak, vDataToHash) % (vData.size() * 8);
}

从这里可以看出，n个不同的hash函数，其实确实可以通过n个不同int即可实现，这里直接通过‘nHashNum * 0xFBA4C795 + nTweak’就达到了不同hash的效果

3)filter应用场景

我们以FILTERED_BLOCK消息为例，该消息的意思是获取指定blockhash中满足bloom filter的block 内容

        else if (inv.type == MSG_FILTERED_BLOCK)
        {
            bool sendMerkleBlock = false;
            CMerkleBlock merkleBlock;
            {
                LOCK(pfrom->cs_filter);
                if (pfrom->pfilter) {
                    sendMerkleBlock = true;
                    //merkleBlock只包含包头，符合条件的娥txhash及partial merklepath
                    //是一种被过滤掉的block content
                    merkleBlock = CMerkleBlock(*pblock, *pfrom->pfilter);
                }
            }
            if (sendMerkleBlock) {
                //返回merkleBlock
                connman->PushMessage(pfrom, msgMaker.Make(NetMsgType::MERKLEBLOCK, merkleBlock));
                // CMerkleBlock just contains hashes, so also push any transactions in the block the client did not see
                // This avoids hurting performance by pointlessly requiring a round-trip
                // Note that there is currently no way for a node to request any single transactions we didn't send here -
                // they must either disconnect and retry or request the full block.
                // Thus, the protocol spec specified allows for us to provide duplicate txn here,
                // however we MUST always provide at least what the remote peer needs
                typedef std::pair<unsigned int, uint256> PairType;
                for (PairType& pair : merkleBlock.vMatchedTxn)
                    //返回符合filter条件的transaction 数据
                    connman->PushMessage(pfrom, msgMaker.Make(SERIALIZE_TRANSACTION_NO_WITNESS, NetMsgType::TX, *pblock->vtx[pair.first]));
            }
            // else
                // no response
        }
}

filter具体过滤过程

CMerkleBlock::CMerkleBlock(const CBlock& block, CBloomFilter* filter, const std::set<uint256>* txids)
{
    header = block.GetBlockHeader();

    std::vector<bool> vMatch;
    std::vector<uint256> vHashes;

    vMatch.reserve(block.vtx.size());
    vHashes.reserve(block.vtx.size());

    for (unsigned int i = 0; i < block.vtx.size(); i++)
    {
        const uint256& hash = block.vtx[i]->GetHash();
        if (txids && txids->count(hash)) {
            vMatch.push_back(true);
        } else if (filter && filter->IsRelevantAndUpdate(*block.vtx[i])) {
            vMatch.push_back(true);
            vMatchedTxn.emplace_back(i, hash);
        } else {
            vMatch.push_back(false);
        }
        vHashes.push_back(hash);
    }

    txn = CPartialMerkleTree(vHashes, vMatch);
}


bool CBloomFilter::IsRelevantAndUpdate(const CTransaction& tx)
{
    bool fFound = false;
    // Match if the filter contains the hash of tx
    //  for finding tx when they appear in a block
    if (isFull)
        return true;
    if (isEmpty)
        return false;
    //获取txhash,看是否在bloom filter集合中

    const uint256& hash = tx.GetHash();
    if (contains(hash))
        fFound = true;

    for (unsigned int i = 0; i < tx.vout.size(); i++)
    {
        const CTxOut& txout = tx.vout[i];
        // Match if the filter contains any arbitrary script data element in any scriptPubKey in tx
        // If this matches, also add the specific output that was matched.
        // This means clients don't have to update the filter themselves when a new relevant tx 
        // is discovered in order to find spending transactions, which avoids round-tripping and race conditions.
        CScript::const_iterator pc = txout.scriptPubKey.begin();
        std::vector<unsigned char> data;
        while (pc < txout.scriptPubKey.end())
        {
            opcodetype opcode;
            //获取锁定脚本中的数据，以用于验证这些数据是否在bloom filter集合中
            if (!txout.scriptPubKey.GetOp(pc, opcode, data))
                break;

            //验证是否在在bloom filter集合中
            if (data.size() != 0 && contains(data))
            {
                fFound = true;
                break;
            }
        }
    }

    if (fFound)
        return true;

    for (const CTxIn& txin : tx.vin)
    {
        // Match if the filter contains an outpoint tx spends
        //txin.prevout是否在bloom filter集合中
        if (contains(txin.prevout))
            return true;

        // Match if the filter contains any arbitrary script data element in any scriptSig in tx
        CScript::const_iterator pc = txin.scriptSig.begin();
        std::vector<unsigned char> data;
        while (pc < txin.scriptSig.end())
        {
            opcodetype opcode;
            //获取解锁脚本以验证是否在在bloom filter集合中
            if (!txin.scriptSig.GetOp(pc, opcode, data))
                break;

            //验证是否在在bloom filter集合中
            if (data.size() != 0 && contains(data))
                return true;
        }
    }

    return false;
}


bool CBloomFilter::contains(const std::vector<unsigned char>& vKey) const
{
    for (unsigned int i = 0; i < nHashFuncs; i++)
    {
        unsigned int nIndex = Hash(i, vKey);
        // Checks bit nIndex of vData
        if (!(vData[nIndex >> 3] & (1 << (7 & nIndex))))
            return false;
    }
    return true;
}

总结，用来filter的数据可以是tx.hash,也可以是txout.scriptPubKey中的data，也可以是txin.scriptSig中的data

比如根据交易的publicKey来过滤交易,就可以在transaction的txin, txout的上做文章.想P2PK的解锁和锁定脚本中都有pubKey,可以用来filter.

/********************************

* 本文来自****博主"爱踢门"

* 转载请标明出处:http://blog.****.net/itleaks

******************************************/

秒客网

比特币源码情景分析之bloom filter精读

相关文章