FP—Growth算法

FP_growth算法是韩家炜老师在2000年提出的关联分析算法，该算法和Apriori算法最大的不同有两点：

第一，不产生候选集，第二，只需要两次遍历数据库，大大提高了效率，用31646条测试记录，最小支持度是2%，

用Apriori算法要半个小时但是用FP_growth算法只要6分钟就可以了，效率非常明显。

它的核心是FP_tree，一种树型数据结构，特点是尽量把相同元素用一个节点表示，这样就大大减少了空间，和birch算法有类似的思想。还是以如下数据为例。

FP—Growth算法
每一行表示一条交易，共有9行，既9笔交易，左边表示交易ID，右边表示商品名称。最小支持度是22%，那么每件商品至少要出现9*22%=2次才算频繁。第一次扫描数据库，统计每件商品出现的次数，按次数对各个商品递减排序，有： FP—Growth算法。然后第二次扫描数据库，在每条交易中按此种顺序给商品排序，如果有某个商品出现的次数小于阈值2，则删除该商品，有：

FP—Growth算法
剩下的就是构造FP_tree了，这是核心，树的每个节点的结构体如下：

//FP-tree的存储结构
typedef struct CSNode{
//商品编号
int item;
//次数
int count;
//父节点，孩子节点，兄弟节点
CSNode *parent,*firstchild,*nextsibling;
//相同商品的前驱，后继节点，方便将相同商品的节点连接起来,根节点的直接孩子节点的这两个指针都是空
CSNode *pre,*next;
}*CSTree;
其中item,*firstchild,*nextsibling是树这个结构体常用的属性。count记录商品item出现的次数，*parent是为了方便从叶子节点逆向访问根节点而设置的。*pre,*next的注释已经很清楚了。构造树的原则是：将每条记录看做一个从根节点到叶子节点的路径，如果某个商品在节点中已经存在了，则对应count计数器加1，相当于所有的前缀都要加1,如果不存在则在该条记录的后面商品开辟一条新的路径。下面一条一条记录演示怎么构造FP_tree。

第三次访问数据库，构造FP_tree。第一条记录：I2,I1,I5，有：

FP—Growth算法
父节点没有表示出来，根节点是空节点。2:1表示商品2出现了1次，其他表示类推。左边的数组按照商品顺序递减排列，保存了各个商品的当前指针，目的是为了在后面找到相同的后缀，将相同的商品用单项箭头虚线连起来，实际是双向链表链接的，并且将此时的节点商品1和节点商品5保存为商品1和商品5的当前指针，而对于商品2，商品3，商品4的当前指针还在左边的数组中保存。注意根节点的直接孩子不用连起来，后面会讲理由。第二条记录：I2,I4，有：

FP—Growth算法
该记录和第一条记录共用前缀I2,所以商品2的次数要加1，而商品4则作为商品2的一个新孩子节点，这里没有把兄弟节点画出来。并且左边商品4要指向该节点，此时商品4的当前指针指向节点商品4。第三条记录：I2,I3,类似，结果是：

FP—Growth算法
第四条记录：I2,I1,I4有：

FP—Growth算法
当添加完商品4后，商品4的当前指针要指向新的节点商品4，此时两条红色的虚线就把以商品4为后缀的节点连起来了。第5条记录：I1,I3,有：

FP—Growth算法
商品1由于和根节点的所有直接子孩子（这里只有商品2这个子孩子）不同，因此要另外开辟一条路径。商品3的当前指针要指向新的节点商品3，如图中的黄色虚线所指，到这里体现了构造FP_tree的一般性了。再把剩下的记录都加进来，最终的FP_tree是：

FP—Growth算法
这颗FP_tree最大程度的把相同的商品放在用同一个节点保存，最大限度的节省了空间。剩下的工作就是挖掘这颗FP_tree了。

挖掘的目的是找出FP_tree的各个路径中相同的集合，有两中方式，方式一，从根节点朝叶子节点顺着遍历树，方式二，从叶子节点朝根节点逆着遍历树。想想方式一挺麻烦的，幸亏我们设置了*parent指针，通过它就可以很方便的用方式二。我们从商品出现次数由少到多的顺序开始遍历树，先从商品5开始，由于有*pre,*next指针分方便将所有以商品5做为叶子节点的路径全找出来，然后再根据*parent指针找到父节点，根节点是空不用找。以I5做元素的条件模式基是：{(I2 I1:1),(I2 I1 I3:1)}。后面的1表示出现商品I2,I1,I5同时出现的次数。现在解释为什么：根节点的直接孩子不用*pre,*next指针连起来，因为假如连起来的话，那么以它为后缀时，将没有前缀，也就是说它的频繁项集是1，这在大多数情况下没意义。由它构造出条件FP_tree，注意由于开始按照商品名称排序了，那么条件模式基中的每一项也会按照这种方式排序。如果条件模式基中某项A是另外一项B的子集那么在算B时，要将A出现的次数加上，实现这个功能最简单明了的方法就是一一匹配，假如条件模式基共有N项，则时间复杂度是N的平方，若先按照条件模式基的长度递增排序得到：{(I2 I1:1),(I2 I1 I3:1)}，排序的时间复杂度是N*log(N),那么只有可能是长度短的项是长度长的项的子集，此时总匹配次数是：N-1 + N-2 + ,,, + 1 = N*(N-1)/2,和前面的排序时间加起来是：N*log(N) + N*(N-1)/2当N大于时4时，该值小于N的平方。在实际中N一般会大于4。最终我们得到以I5作为后缀的频繁项集是：{I2 I5:2},{I1 I5:2},{I2 I1 I5:2}他们出现的次数都大于等于最小支持度。类似可以得到其它后缀的频繁项集。

FP_growth算法不产生候选序列，并且只需要3次遍历数据库，对比Apriori算法而言有了很大的改进。其实想想这也符合历史发展的规律，Apriori在1993年才提出来的，那是数据挖掘才刚起步，而到2000年时，已经有了一定的发展，FP_growth是站在Apriori的肩膀上发明的，这种现象具有普遍性。

FP—growth代码实现部分

主程序部分

 package DataMining_FPTree;

 /**

  * FPTree频繁模式树算法

  * 一个使用的这个算法的用例是输入一个单词或者单词的一部分，搜索引擎就会自动 补全查询词项，通过查看互联网上的用词来找出经常在一块出现的词对（使用Aporior算法也是找出经常出现的词对，这两种方法都是无监督学习），这需要一种发现频繁集的方法

  * @author clj

  *

  */

 public class Client {

     public static void main(String[] args){

         //这里使用的是输入文件的绝对路径

         String filePath="E:\\code\\data mining\\DataMining_FPTree\\src\\DataMining_FPTree\\testInput.txt";

         //最小支持度阈值

         int minSupportCount = 2;

         //构造函数

         FPTreeTool tool = new FPTreeTool(filePath, minSupportCount);

         //调用构建树

         tool.startBuildingTree();

     }

 }

树节点的数据结构

 package DataMining_FPTree;

 import java.util.ArrayList;

 /**

  * FP树节点

  * 这里使用Comparable的原因是因为每个项集要进行排序

  * 按照节点的count来排序

  * @author clj

  *

  */

 public class TreeNode implements Comparable<TreeNode>, Cloneable{

     // 节点类别名称

     private String name;

     // 计数数量

     private Integer count;

     // 父亲节点，这个节点的用法是根据给定叶子节点上溯到整棵树，这时就需要指向父节点

     private TreeNode parentNode;

     // 孩子节点，可以为多个

     private ArrayList<TreeNode> childNodes;

     public TreeNode(String name, int count){

         this.name = name;

         this.count = count;

     }

     public String getName() {

         return name;

     }

     public void setName(String name) {

         this.name = name;

     }

     public Integer getCount() {

         return count;

     }

     public void setCount(Integer count) {

         this.count = count;

     }

     public TreeNode getParentNode() {

         return parentNode;

     }

     public void setParentNode(TreeNode parentNode) {

         this.parentNode = parentNode;

     }

     public ArrayList<TreeNode> getChildNodes() {//孩子节点可能不止一个，所以需要用list来保存

         return childNodes;

     }

     public void setChildNodes(ArrayList<TreeNode> childNodes) {

         this.childNodes = childNodes;

     }

     @Override

     public int compareTo(TreeNode o) {

         // TODO Auto-generated method stub

         return o.getCount().compareTo(this.getCount());

     }

     @Override

     protected Object clone() throws CloneNotSupportedException {//如果想重写父类的方法，比如toString()方法的话，在方法前面加上@Override  系统可以帮你检查方法的正确性，

         // TODO Auto-generated method stub

         //因为对象内部有引用，需要采用深拷贝，这里就相当于是一个深度优先搜索,这里的clone相当于没有用

         //System.out.println("The name="+this.getName());

         TreeNode node = (TreeNode)super.clone();

         if(this.getParentNode() != null){

             node.setParentNode((TreeNode) this.getParentNode().clone());

         }

         if(this.getChildNodes() != null){

             node.setChildNodes((ArrayList<TreeNode>) this.getChildNodes().clone());

         }

         return node;

     }

 }

程序的主要部分FPTreeTool

 package DataMining_FPTree;

 import java.io.BufferedReader;

 import java.io.File;

 import java.io.FileReader;

 import java.io.IOException;

 import java.util.ArrayList;

 import java.util.Collections;

 import java.util.HashMap;

 import java.util.Iterator;

 import java.util.Map;

 import java.util.Map.Entry;

 /**

  * FPTree算法工具类

  * 与Apriori算法不同的是FP树需要将非频繁项移除并且重排序

  * @author clj

  *

  */

 public class FPTreeTool {

     // 输入数据文件位置

     private String filePath;

     // 最小支持度阈值

     private int minSupportCount;

     // 所有事物ID记录

     private ArrayList<String[]> totalGoodsID;

     // 各个ID的统计数目映射表项，计数用于排序使用，用于项集

     private HashMap<String, Integer> itemCountMap;

     //后面的成员方法中并没有重新定义成员变量，所以成员函数中可以改变的成员变量的值

     public FPTreeTool(String filePath, int minSupportCount) {

         this.filePath = filePath;

         this.minSupportCount = minSupportCount;

         readDataFile();

     }

     /**

      * 从文件中读取数据,至此还没有对数据进行排序

      */

     private void readDataFile() {

         File file = new File(filePath);

         ArrayList<String[]> dataArray = new ArrayList<String[]>();

         try {

             BufferedReader in = new BufferedReader(new FileReader(file));//这一句话相当于新建了两个对象

             String str;

             String[] tempArray;

             while ((str = in.readLine()) != null) {

                 tempArray = str.split(" ");

                 dataArray.add(tempArray);

             }

             in.close();

         } catch (IOException e) {

             e.getStackTrace();

         }

         String[] temp;

         int count = 0;

         itemCountMap = new HashMap<>();//之所以使用会使用hashMap的形式是因为后面会更改key所对应的value的值，时间复杂度小

         totalGoodsID = new ArrayList<>();//totalGoodsId只需要将其保存在矩阵中

         for (String[] a : dataArray) {

             temp = new String[a.length - 1];

             System.arraycopy(a, 1, temp, 0, a.length - 1);//和Apriori算法一样第一个保存的是第几笔记录

             totalGoodsID.add(temp);

             for (String s : temp) {

                 if (!itemCountMap.containsKey(s)) {

                     count = 1;

                 } else {

                     count = ((int) itemCountMap.get(s));

                     // 支持度计数加1

                     count++;

                 }

                 // 更新表项，如果有key s，则直接更新，否则创建

                 itemCountMap.put(s, count);

             }

         }

         System.out.println("name="+itemCountMap.keySet()+" count="+itemCountMap.values());

     }

     /**

      * 根据事务记录构造FP树

      * 当suffixPattern不为空的时候，建立的就是条件FP树

      * surffixPatter是后缀模式

      */

     private void buildFPTree(ArrayList<String> suffixPattern,

             ArrayList<ArrayList<TreeNode>> transctionList) {

         // 设置一个空根节点

         TreeNode rootNode = new TreeNode(null, 0);

         int count = 0;

         // 节点是否存在

         boolean isExist = false;

         ArrayList<TreeNode> childNodes;

         ArrayList<TreeNode> pathList;

         // 相同类型节点链表，用于构造的新的FP树

         HashMap<String, ArrayList<TreeNode>> linkedNode = new HashMap<>();//每个节点的LinkNode

         HashMap<String, Integer> countNode = new HashMap<>();

         // 根据事务记录，一步步构建FP树，逐个读入事务记录，并把每个事务映射到FP树中的一条路径中

         for (ArrayList<TreeNode> array : transctionList) {

             TreeNode searchedNode;//TreeNode节点中每个项集中应该是只有一个元素

             pathList = new ArrayList<>();//在构建的时候，将读入的每个项集添加到一条已经存在的路径中

             /*

             System.out.print("array=");

             for(int i=0;i<array.size();i++)

                 System.out.print("\t"+array.get(i).getName());

             System.out.println();

             */

             for (TreeNode node : array) {//array保存的是FP中的一条路径

                 pathList.add(node);//pathList开始为空，在事务中读到一个节点就把它放到pathList中

                 //System.out.println("正在处理的节点node="+node.getName()+" count="+node.getCount());

                 //System.out.println("before keySets="+countNode.keySet()+"count="+countNode.values());

                 nodeCounted(node, countNode);//countNode是一个HashMap类型，初始时为一个空的HashMap，在读事务过程中依次进行修改

                 //System.out.println("after keySets="+countNode.keySet()+"count="+countNode.values());

                 /*System.out.print("pathList=");

                 for(int i=0;i<pathList.size();i++)

                     System.out.print("\t"+pathList.get(i).getName());

                 System.out.println();

                 */

                 searchedNode = searchNode(rootNode, pathList);//这里只是查找，不会影响count的变化

                 childNodes = searchedNode.getChildNodes();

                 if (childNodes == null) {//如果正好找到路径中的结尾，则直接加入到结尾

                     //System.out.println("找到了对应的叶节点，在叶节点下存储");

                     childNodes = new ArrayList<>();

                     childNodes.add(node);

                     searchedNode.setChildNodes(childNodes);

                     node.setParentNode(searchedNode);

                     nodeAddToLinkedList(node, linkedNode);

                 } else {

                     isExist = false;

                     for (TreeNode node2 : childNodes) {

                         // 如果找到名称相同，则更新支持度计数

                         //System.out.println("##############");

                         if (node.getName().equals(node2.getName())) {

                             //System.out.println("在父节点下找到了对应的节点");

                             count = node2.getCount() + node.getCount();

                             node2.setCount(count);

                             // 标识已找到节点位置

                             isExist = true;

                             break;

                         }

                     }

                     if (!isExist) {

                         // 如果没有找到，需添加子节点

                         //System.out.println("&没有在父节点下找到了对应的节点");

                         childNodes.add(node);

                         node.setParentNode(searchedNode);

                         nodeAddToLinkedList(node, linkedNode);

                     }

                 }

                 //System.out.println("countNode.key="+countNode.keySet()+"   value="+countNode.values());

                 /*Iterator<Entry<String, ArrayList<TreeNode>>> it = linkedNode.entrySet().iterator();

                 while( it.hasNext())

                 {

                     Map.Entry<String, ArrayList<TreeNode>> entry = it.next();

                     String key = entry.getKey();

                     ArrayList<TreeNode> values=(ArrayList<TreeNode>)entry.getValue();

                     for(TreeNode value:values)

                     {

                         System.out.print(" linkedNode.name="+value.getName()+"\tLinkedNode.count="+value.getCount());

                     }

                     System.out.println();

                     //TreeNode tempNode= entry.getValue().get(i);

                 }*/

             }

         }

         // 如果FP树已经是单条路径，则输出此时的频繁模式

         if(suffixPattern!=null)

         {

             System.out.println("suffixPattern.size="+suffixPattern.size());

             for(int i=0;i<suffixPattern.size();i++)

                 System.out.print(suffixPattern.get(i)+"\t ");

             System.out.println();

         }

         else

             System.out.println("suffixPattern.size=0");

         if (isSinglePath(rootNode)) {

             System.out.println("issinglePath-------");

             printFrequentPattern(suffixPattern, rootNode);

         } else {

             ArrayList<ArrayList<TreeNode>> tList;

             ArrayList<String> sPattern;

             if (suffixPattern == null) {

                 sPattern = new ArrayList<>();

             } else {

                 // 进行一个拷贝，避免互相引用的影响

                 sPattern = (ArrayList<String>) suffixPattern.clone();

             }

             // 利用节点链表构造新的事务

             for (Map.Entry entry : countNode.entrySet()) {

                 // 添加到后缀模式中

                 sPattern.add((String) entry.getKey());

                 System.out.println("entry.key="+entry.getKey()+"\tentry.value="+entry.getValue());

                 //获取到了条件模式机，作为新的事务

                 tList = getTransactionList((String) entry.getKey(), linkedNode);

                 System.out.print("[后缀模式]：{");

                 for(String s: sPattern){

                     System.out.print(s + ", ");

                 }

                 System.out.print("}, 此时的条件模式基：");

                 for(ArrayList<TreeNode> tnList: tList){

                     System.out.print("{");

                     for(TreeNode n: tnList){

                         System.out.print(n.getName() + ", ");

                     }

                     System.out.print("}, ");

                 }

                 System.out.println();

                 // 递归构造FP树

                 buildFPTree(sPattern, tList);

                 // 再次移除此项，构造不同的后缀模式，防止对后面造成干扰

                 sPattern.remove((String) entry.getKey());

             }

         }

     }

     /**

      * 将节点加入到同类型节点的链表中

      *

      * @param node

      *            待加入节点

      * @param linkedList

      *            链表图

      */

     private void nodeAddToLinkedList(TreeNode node,

             HashMap<String, ArrayList<TreeNode>> linkedList) {

         String name = node.getName();

         ArrayList<TreeNode> list;

         if (linkedList.containsKey(name)) {

             list = linkedList.get(name);

             // 将node添加到此队列末尾

             list.add(node);

         } else {

             list = new ArrayList<>();

             list.add(node);

             linkedList.put(name, list);

         }

     }

     /**

      * 根据链表构造出新的事务,根据name，得到以name为尾的各记录

      *

      * @param name

      *            节点名称

      * @param linkedList

      *            链表

      * @return

      */

     private ArrayList<ArrayList<TreeNode>> getTransactionList(String name,

             HashMap<String, ArrayList<TreeNode>> linkedList) {

         ArrayList<ArrayList<TreeNode>> tList = new ArrayList<>();

         ArrayList<TreeNode> targetNode = linkedList.get(name);

         ArrayList<TreeNode> singleTansaction;

         TreeNode temp;

         System.out.println("#getTransaction中name="+name);

         for (TreeNode node : targetNode) {

             singleTansaction = new ArrayList<>();

             temp = node;

             while (temp.getParentNode().getName() != null) {

                 System.out.println("temp.name="+temp.getName()+"\tcount="+temp.getCount());

                 temp = temp.getParentNode();

                 singleTansaction.add(new TreeNode(temp.getName(), 1));

             }

             System.out.println("temp.name="+temp.getName()+"\tcount="+temp.getCount());

             System.out.println("singleTansaction=");

             for(int i=0;i<singleTansaction.size();i++)

             {

                 System.out.println("("+singleTansaction.get(i).getName()+","+singleTansaction.get(i).getCount()+")");

             }

             System.out.println();

             // 按照支持度计数得反转一下

             Collections.reverse(singleTansaction);

             for (TreeNode node2 : singleTansaction) {

                 // 支持度计数调成与模式后缀一样

                 node2.setCount(node.getCount());

             }

             System.out.println("##singleTansaction=");

             for(int i=0;i<singleTansaction.size();i++)

             {

                 System.out.println("("+singleTansaction.get(i).getName()+","+singleTansaction.get(i).getCount()+")");

             }

             System.out.println();

             if (singleTansaction.size() > 0) {

                 tList.add(singleTansaction);

             }

         }

         return tList;

     }

     /**

      * 节点计数

      *

      * @param node

      *            待加入节点

      * @param nodeCount

      *            计数映射图

      */

     private void nodeCounted(TreeNode node, HashMap<String, Integer> nodeCount) {

         int count = 0;

         String name = node.getName();

         if (nodeCount.containsKey(name)) {

             count = nodeCount.get(name);

             count++;

         } else {

             count = 1;

         }

         nodeCount.put(name, count);

     }

     /**

      * 显示决策树

      *

      * @param node

      *            待显示的节点

      * @param blankNum

      *            行空格符，用于显示树型结构

      */

     private void showFPTree(TreeNode node, int blankNum) {

         System.out.println("￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥￥显示FPTree");

         for (int i = 0; i < blankNum; i++) {

             System.out.print("\t");

         }

         System.out.print("--");

         System.out.print("--");

         if (node.getChildNodes() == null) {//叶子节点

             System.out.print("[");

             System.out.print("I" + node.getName() + ":" + node.getCount());

             System.out.print("]");

         } else {

             // 递归显示子节点

              System.out.print("【" + node.getName() + "】");

             for (TreeNode childNode : node.getChildNodes()) {

                 showFPTree(childNode, 2 * blankNum);

             }

         }

     }

     /**

      * 待插入节点的抵达位置节点，从根节点开始向下寻找待插入节点的位置，返回待插入节点的父节点

      *

      * @param root

      * @param list

      * @return

      */

     private TreeNode searchNode(TreeNode node, ArrayList<TreeNode> list) {

         ArrayList<TreeNode> pathList = new ArrayList<>();

         TreeNode tempNode = null;

         TreeNode firstNode = list.get(0);

         boolean isExist = false;

         // 重新转一遍，避免出现同一引用

         for (TreeNode node2 : list) {

             pathList.add(node2);

         }

         //System.out.println("待插入的节点：name="+node.getName()+" count="+node.getCount());

         /*for(int i=0;i<list.size();i++)

             System.out.print("\t("+list.get(i).getName()+","+list.get(i).getCount()+")");

         System.out.println();*/

         // 如果没有孩子节点，则直接返回，在此节点下添加子节点，查找已构建树中的叶子节点

         if (node.getChildNodes() == null) {

             //System.out.println("此节点为叶子节点，为返回的节点，node.name="+node.getName()+" count="+node.getCount());

             return node;

         }

         for (TreeNode n : node.getChildNodes()) {

             if (n.getName().equals(firstNode.getName()) && list.size() == 1) {//list中只有一个元素，即路径中的第一个元素

                 tempNode = node;

                 isExist = true;

                 //System.out.println("第一个元素恰好为要查找的节点，且节点长度为1");

                 break;

             } else if (n.getName().equals(firstNode.getName())) {

                 // 还没有找到最后的位置，继续找，在查找的过程中时是正好匹配，从路径中消除

                 //System.out.println("#第一个元素恰好为要查找的节点，且节点长度不为1");

                 pathList.remove(firstNode);

                 tempNode = searchNode(n, pathList);//使用递归的形式去查询子节点

                 //System.out.println("￥￥￥返回节点：tempNode.name="+tempNode.getName()+" count="+tempNode.getCount());

                 return tempNode;

             }

         }

         // 如果没有找到，则新添加到孩子节点中

         if (!isExist) {

             //System.out.println("没有找到");

             tempNode = node;

         }

         //System.out.println("@@@@返回节点：node.name="+tempNode.getName()+" count="+tempNode.getCount());

         return tempNode;

     }

     /**

      * 判断目前构造的FP树是否是单条路径的

      *

      * @param rootNode

      *            当前FP树的根节点

      * @return

      */

     private boolean isSinglePath(TreeNode rootNode) {

         // 默认是单条路径

         boolean isSinglePath = true;

         ArrayList<TreeNode> childList;

         TreeNode node;

         node = rootNode;

         //是使用循环而不是递归判断是否是单条路径

         while (node.getChildNodes() != null) {

             childList = node.getChildNodes();

             if (childList.size() == 1) {

                 node = childList.get(0);

             } else {

                 isSinglePath = false;

                 break;

             }

         }

         return isSinglePath;

     }

     /**

      * 开始构建FP树

      */

     public void startBuildingTree() {

         ArrayList<TreeNode> singleTransaction;//单条事务

         ArrayList<ArrayList<TreeNode>> transactionList = new ArrayList<>();//事务总链

         TreeNode tempNode;

         int count = 0;

         for (String[] idArray : totalGoodsID) {

             singleTransaction = new ArrayList<>();

             for (String id : idArray) {

                 count = itemCountMap.get(id);

                 tempNode = new TreeNode(id, count);

                 singleTransaction.add(tempNode);

             }

             // 根据支持度数的多少进行排序

             Collections.sort(singleTransaction);

             /*System.out.println("singleTansaction as following:");

             for(int i=0;i<singleTransaction.size();i++)

                 System.out.print("("+singleTransaction.get(i).getName()+","+singleTransaction.get(i).getCount()+")");

             System.out.println();*/

             for (TreeNode node : singleTransaction) {

                 // 支持度计数重新归为1,将事务路径节点的count设置为1

                 node.setCount(1);

             }

             /*System.out.println("singleTansaction");

             for(int i=0;i<singleTransaction.size();i++)

                 System.out.print("***("+singleTransaction.get(i).getName()+","+singleTransaction.get(i).getCount()+")");

             System.out.println();*/

             transactionList.add(singleTransaction);

         }

         for(int i=0;i<transactionList.size();i++)

         {

             ArrayList<TreeNode> singleTransaction1=new ArrayList<>();

             singleTransaction1=transactionList.get(i);

             for(int j=0;j<singleTransaction1.size();j++)

             {

                 System.out.print("("+singleTransaction1.get(j).getName()+","+singleTransaction1.get(j).getCount()+")");

             }

             System.out.println();

         }

         buildFPTree(null, transactionList);

     }

     /**

      * 输出此单条路径下的频繁模式

      *

      * @param suffixPattern

      *            后缀模式

      * @param rootNode

      *            单条路径FP树根节点

      */

     private void printFrequentPattern(ArrayList<String> suffixPattern,

             TreeNode rootNode) {

         ArrayList<String> idArray = new ArrayList<>();

         TreeNode temp;

         temp = rootNode;

         // 用于输出组合模式

         int length = 0;

         int num = 0;

         int[] binaryArray;

         while (temp.getChildNodes() != null) {

             temp = temp.getChildNodes().get(0);

             // 筛选支持度系数大于最小阈值的值，P(A)>P(AB),若P(A)<阈值，则删除这个节点即不添加到里面

             if (temp.getCount() >= minSupportCount) {

                 idArray.add(temp.getName());

             }

         }

         length = idArray.size();

         num = (int) Math.pow(2, length);

         for (int i = 0; i < num; i++) {

             binaryArray = new int[length];

             numToBinaryArray(binaryArray, i);

             // 如果后缀模式只有1个，不能输出自身

             if (suffixPattern.size() == 1 && i == 0) {

                 continue;

             }

             System.out.print("频繁模式：{【后缀模式：");

             // 先输出固有的后缀模式

             if (suffixPattern.size() > 1

                     || (suffixPattern.size() == 1 && idArray.size() > 0)) {

                 for (String s : suffixPattern) {

                     System.out.print(s + ", ");

                 }

             }

             System.out.print("】");

             // 输出路径上的组合模式

             for (int j = 0; j < length; j++) {

                 if (binaryArray[j] == 1) {

                     System.out.print(idArray.get(j) + ", ");

                 }

             }

             System.out.println("}");

         }

     }

     /**

      * 数字转为二进制形式

      *

      * @param binaryArray

      *            转化后的二进制数组形式

      * @param num

      *            待转化数字

      */

     private void numToBinaryArray(int[] binaryArray, int num) {

         int index = 0;

         while (num != 0) {

             binaryArray[index] = num % 2;

             index++;

             num /= 2;

         }

     }

 }

readDataFile从文件中读取数据，

buildFPTree(ArrayList<String> suffixPattern,ArrayList<ArrayList<TreeNode>> transctionList)构建FP树（包括FP条件树），当suffixpatter不为空的时候构建的就是FP条件树，

nodeAddToLinkedList(TreeNode node,HashMap<String, ArrayList<TreeNode>> linkedList)，和邻接表类似，某个Node在树中出现的位置保存在linkedList中

private ArrayList<ArrayList<TreeNode>> getTransactionList(String name,HashMap<String, ArrayList<TreeNode>> linkedList）得到还有name节点的交易记录

nodeCounted(TreeNode node, HashMap<String, Integer> nodeCount)因为最后交易记录是以节点的计数多少进行排序的，这一个记录node在所有记录中出现的次数，同一条事务其实是没有先后顺序的，为了把树尽可能的减小才这样进行排序的

showFPTree(TreeNode node, int blankNum) 展示树

private TreeNode searchNode(TreeNode node, ArrayList<TreeNode> list) 要插入的节点在树中应该插入到哪个节点的下面呢，这里返回的是待插入节点的父节点

printFrequentPattern(ArrayList<String> suffixPattern,TreeNode rootNode)输出单条路径下的频繁模式，

常见的频繁项集挖掘算法有两类，一类是Apriori算法，另一类是FPGrowth。Apriori通过不断的构造候选集、筛选候选集挖掘出频繁项集，需要多次扫描原始数据，当原始数据较大时，磁盘I/O次数太多，效率比较低下。FPGrowth算法则只需扫描原始数据两遍，通过FP-tree数据结构对原始数据进行压缩，效率较高。

也许有人会问?如果这个数据库足够大,以至于构造的FP树大到无法完全保存在内存中,这该如何是好.这的确是个问题. Han Jiawei在论文中也给出了一种思路,就是通过将原来的大的数据库分区成几个小的数据库(这种小的数据库称之为投射数据库),对这几个小的数据库分别进行FP Growth算法.
还是拿上面的例子来说事,我们把包含p的所有数据库记录都单独存成一个数据库,我们称之为p-投射数据库,类似的m,b,a,c,f我们都可以生成相应的投射数据库,这些投射数据库构成的FP树相对而言大小就小得多,完全可以放在内存里.
在现代数据挖掘任务中,数据量越来越大,因此并行化的需求越来越大,上面提出的问题也越来越迫切.下一篇博客,博主将分析一下,FP Growth如何在MapReduce的框架下并行化.
[1]Mining Frequent Patterns without Candidate Gen

相关文章