17 Great Machine Learning Libraries

08 October 2013

After wonderful feedback on my previous post on Scikit-learn from the guys at /r/MachineLearning, I decided to collect the list of machine learning libraries into this seperate note. Let me know if there’s a library that should be included here.

Update (15 May 2014): thanks to Djalel Benbouzid and Dwayne Campbell for additional suggestions. Sorry it’s taken me so long to add them…


  • Scikit-learn: comprehensive and easy to use, I wrote a whole article on why I like this library.
  • PyBrain: Neural networks are one thing that are missing from SciKit-learn, but this module makes up for it.
  • nltk: really useful if you’re doing anything NLP or text mining related.
  • Theano: efficient computation of mathematical expressions using GPU. Excellent for deep learning.
  • Pylearn2: machine learning toolbox built on top of Theano - in very early stages of development.
  • MDP (Modular toolkit for Data Processing): a framework that is useful when setting up workflows.


  • Spark: Apache’s new upstart, supposedly up to a hundred times faster than Hadoop, now includes MLLib, which contains a good selection of machine learning algorithms, including classification, clustering and recommendation generation. Currently undergoing rapid development. Development can be in Python as well as JVM languages.
  • Mahout: Apache’s machine learning framework built on top of Hadoop, this looks promising, but comes with all the baggage and overhead of Hadoop.
  • Weka: this is a Java based library with a graphical user interface that allows you to run experiments on small datasets. This is great if you restrict yourself to playing around to get a feel for what is possible with machine learning. However, I would avoid using this in production code at all costs: the API is very poorly designed, the algorithms are not optimised for production use and the documentation is often lacking.
  • Mallet: another Java based library with an emphasis on document classification. I’m not so familiar with this one, but if you have to use Java this is bound to be better than Weka.
  • JSAT: stands for “Java Statistical Analysis Tool” - created by Edward Raff and was born out of his frustation with Weka (I know the feeling). Looks pretty cool.


  • Accord.NET: this seems to be pretty comprehensive, and comes recommended by primaryobjects on Reddit. There is perhaps a slight slant towards image processing and computer vision, as it builds on the popular library AForge.NET for this purpose.
  • Another option is to use one of the Java libraries compiled to .NET using IKVM - I have used this approach with success in production.


  • Vowpal Wabbit: designed for very fast learning and released under a BSD license, this comes recommended by terath on Reddit.
  • MultiBoost: a fast C++ framework implementing some boosting algorithms as well as some cascades (like the Viola-Jones cascades). It’s mainly focused on AdaBoost.MH so it is multi-class/multi-label.
  • Shogun: large machine learning library with a focus on kernel methods and support vector machines. Bindings to Matlab, R, Octave and Python.


  • LibSVM and LibLinear: these are C libraries for support vector machines; there are also bindings or implementations for many other languages. These are the libraries used for support vector machine learning in Scikit-learn.


This article is a work in progress, so please send me your comments or criticisms!

Want more? Sign up below to get a free ebook Machine Learning in Practice, and updates on new posts:

1. Shark,基于c++
2. scikit,基于python
3. weka,基于java
4. opencv-ml,基于c++,图像处理中用的比较多,之前已接触过

环境:win32, vs10


. 严重不对,因为SVN下载的是开发版,有时会缺少文件导致VS编译不成功,最终无法使用.我在按照svn下载安装时,缺少LinAlg的文件,根本无法使用.坚决建议大家别采用.
第2篇错误 http://shark-project.sourceforge.net/,根本找不到文件,地址早就失效了.该篇文章后面介绍的安装和使用还凑合.


Shark利用CMake进行编译,需要C++ Boost库支持.具体后续.


Shark Machine Learning Library 的主页链接是:http://shark-project.sourceforge.net/,shark是由德国波鸿大学开发的,曾获得2011年世界开源大赛金奖。shark基于C++的泛型编程,里面大量使用了模板,因此封装性和继承性极佳。由于是基于C++的,所以函数的效率还是不错的。


  1. ReClaM     回归与分类模块 涵盖了线性方法、神经网络、SVM、Kernel 等
  2. EALib      进化计算模块
  3. MOO-EAlib  多目标的进化计算
  4. Fuzzy      模糊计算模块

OK, 开始吧,下面进入安装过程。shark的函数库可以安装在Microsoft,Linux,Mac 的操作系统上,本文介绍其在
Microsoft Windows 上的安装过程。值得注意的是,在下载的shark包路径 Shark/doc/TutorialsOld/

第一步,准备安装软件,产生编译文件。跨平台编译工具 Cmake v2.8,Mircosoft Visual Stdio 2005 或更高版本。我的shark 包的路径在 D:/shark ,cmake的设置如下
17 Great Machine Learning Libraries
点击configure 按钮,选择我们需要的编译器 VS2005,然后再点击 Generate。完成后显示如下

17 Great Machine Learning Libraries

这时候去看看 D:/build_shark 路径下,cmake 已经为我们生成了 VS2005 需要的编译文件了

第二步,使用 VS2005 编译连接,得到我们需要的 shark.lib 静态链接库。

双击 build_shark 文件夹下面的 shark.sln, 把工程导入到 vs2005 编译环境下。

这里大家就可以看到 shark


OK,编译完成后,看看 build_shark 文件夹下面多出来了好几个文件件,其中examples 下面就是所有的实例程序,当然还没有debug呢,需要哪个的话,自己去搞吧,关键是注意 debug 文件夹,下面终于见到我们需要的东西了:shark.lib


下一篇我讲一下如何把我们得到的shark.lib 导入到自己的工程里面,运行一个实例。


在上一篇里面,我们最后得到了Shark Machine Learning Library 的shark.lib 静态链接库。本文将继续讲解,使用得到的库,在VS2005 环境里运行一个shark自带的例子,这个例子叫做“TSP_GA”,看名字就知道了,使用遗传算法求解TSP问题的。


第一步,先到这个路径Shark\examples\EALib 下面,找到本文要用的源文件TSP_GA.cpp。新建一个工程,文件路径下新建两个文件夹,一个叫include,一个叫lib,分别用于放置shark的头文件和链接库。


17 Great Machine Learning Libraries


17 Great Machine Learning Libraries


17 Great Machine Learning Libraries


第三步,运行 TSP_GA 工程,成功!恭喜你,你已经成功安装了 shark 库函数!

17 Great Machine Learning Libraries

说明一下,由于是控制台应用程序,最后运行完可能闪一下就没了。一个小技巧是,在程序最后加一句 getchar(); 这样敲回车才会退出。


