数据结构《16》----自动补齐实现《一》----Trie 树

1. 简述

Trie 树是一种高效的字符串查找的数据结构。可用于搜索引擎中词频统计，自动补齐等。

在一个Trie 树中插入、查找某个单词的时间复杂度是 O(len), len是单词的长度。

如果采用平衡二叉树来存储的话，时间复杂度是 O(lgN), N为树中单词的总数。

此外，Trie 树还特别擅长前缀搜索，比方说现在输入法中的自动补齐，输入某个单词的前缀，abs,

立刻弹出 abstract 等单词。

Trie 树优良的查找性能是建立在牺牲空间复杂度的基础之上的。

本文将给出一个 Trie树的简单实例，并用这个Trie建立了一个单词数目是 7000+的英语词典。

从而分析 Trie 树所占的空间。

2. 定义

一棵典型的 Trie 树，如下图所示：

每一个节点包含一个长度是 26 的指针数组。这 26 个指针分别代表英文 26 个字母。

同时，每个节点拥有一个红色标记，表示 root 到当前的路径是否是一个单词。

例：下图中最左边的一个路径表示单词 abc 和 abcd.

数据结构《16》----自动补齐实现《一》----Trie 树

3. 性能

本人做了一个小测试，当建立一个 7000+ 的词典时，Trie 树共分配了 22383 个节点，每个节点占了 27 * 4 BYTE，

所以共消耗了大约 22383 * 27 * 4 BYTE = 2.4 M

而这 7000 个单词平均长度假设是 8 个字母，那么总共占 7000 * 8 BYTE= 5.6 KB

两者相差 42 倍！！！

从上述小测试可以看到，Trie 树需要占用大量的空间，特别是如果考虑大小写，或者建立汉字的 Trie树时，每个节点所需要的指针数目将更大。

其实，大伙一眼就能发现，Trie 树中，每个节点包含了大量的空指针，因而造成了大量的空间消耗。

可以采用三叉树（Ternary Search Tree）, 改进 Trie 树。将在下一篇文章中讨论。

4. 源码

// Last Update:2014-04-16 23:24:47

/**

 * @file trie.h

 * @brief Trie

 * @author shoulinjun@126.com

 * @version 0.1.00

 * @date 2014-04-16

 */

#ifndef TRIE_H

#define TRIE_H

#include <iostream>

#include <fstream>

#include <string>

#include <cstring>

using std::string;

using std::cout;

using std::endl;

const int branchNum = 26; 

struct TrieNode

{

  TrieNode(): isStr(false)

  {

    memset(next, 0, sizeof(next));

  }

  bool isStr;

  TrieNode* next[branchNum];

};

string ToLower(const string &s)

{

  string str;

  string::const_iterator it = s.begin();

  while(it != s.end())

  {

    str += (char)tolower(*it);

    ++ it;

  }

  return str;

}

/**

 * a simple data stucture

 * usefull for AutoComplete

 */

class Trie

{

public:

  Trie(): root(new TrieNode()) {}

  ~Trie() {

    cout << "# of nodes allocated: " << count << endl;

    destroy(root); }

  void Insert(const string &str);

  bool Search(const string &str) const;

  void AutoComplete(const string &str);

  void Input(const string &file);

private:

  TrieNode* find(const string &str) const;

  void dfs(TrieNode *root, string &path);

  void destroy(TrieNode * &root);

  TrieNode *root;

  static size_t count;

};

size_t Trie::count = 0;

void Trie::destroy(TrieNode * &root)

{

  for(int i=0; i<branchNum; ++i)

  {

    if(root->next[i])

      destroy(root->next[i]);

  }

  delete root;

  root = NULL;

}

void Trie::Insert(const string &s)

{

  if(s.empty()) return;

  /* support lower cases now */

  string str = ToLower(s);

  string::const_iterator it = str.begin();

  TrieNode *location(root);

  // bypassing existing nodes

  while(it != str.end() && location->next[*it - 'a'] != NULL)

  {

    location = location->next[*it - 'a'];

    ++ it;

  }

  // Insert

  while(it != str.end() && location->next[*it - 'a'] == NULL)

  {

    location->next[*it - 'a'] = new TrieNode();

    ++ count;

    location = location->next[*it - 'a'];

    ++ it;

  }

  location->isStr = true;

}

void Trie::Input(const string &str)

{

  std::ifstream ifile(str.c_str());

  string word;

  while(ifile >> word)

  {

    Insert(word);

  }

  ifile.close();

}

bool Trie::Search(const string &s) const

{

  TrieNode *location = root;

  string str = ToLower(s);

  location = find(str);

  return (location) && location->isStr;

}

TrieNode* Trie::find(const string &str) const

{

  TrieNode *location = root;

  string::const_iterator it = str.begin();

  while(it != str.end() && location->next[*it - 'a'] != NULL)

  {

    location = location->next[*it - 'a'];

    ++ it;

  }

  return (it == str.end()) ? location : NULL;

}

void Trie::dfs(TrieNode *root, string &path)

{

  if(root == NULL) return;

  if(root->isStr)

    cout << path << endl;

  for(char x='a'; x<='z'; ++x)

  {

    if(root->next[x-'a'] != NULL)

    {

      path += x;

      dfs(root->next[x-'a'], path);

      path.resize(path.size()-1);

    }

  }

}

void Trie::AutoComplete(const string &str)

{

  TrieNode *location(root);

  string path;

  location = find(str);

  path = str;

  dfs(location, path);

}

#endif  /*TRIE_H*/