如何解释Spark OneHotEncoder的结果

I read the OHE entry from Spark docs,

我从Spark docs上阅读了OHE条目，

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

单热编码将一列标签索引映射到一列二进制向量，最多只有一个单值。此编码允许期望连续特征（例如Logistic回归）的算法使用分类特征。

but sadly they do not give full explanation on the OHE result. So ran the given code:

但遗憾的是，他们没有对OHE结果给出完整的解释。所以运行给定的代码：

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sqlContext.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category",      outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

And got the results:

并得到了结果：

   +---+--------+-------------+-------------+
   | id|category|categoryIndex|  categoryVec|
   +---+--------+-------------+-------------+
   |  0|       a|          0.0|(2,[0],[1.0])|
   |  1|       b|          2.0|    (2,[],[])|
   |  2|       c|          1.0|(2,[1],[1.0])|
   |  3|       a|          0.0|(2,[0],[1.0])|
   |  4|       a|          0.0|(2,[0],[1.0])|
   |  5|       c|          1.0|(2,[1],[1.0])|
   +---+--------+-------------+-------------+

How could I interpret the results of OHE(last column)?

我怎么能解释OHE的结果（最后一栏）？

1 个解决方案

#1

One-hot encoding transforms the values in categoryIndex into a binary vector where at maximum one value may be 1. Since there are three values, the vector is of length 2 and the mapping is as follows:

单热编码将categoryIndex中的值转换为二进制向量，其中最多一个值可以是1.由于有三个值，因此向量的长度为2，映射如下：

0  -> 10
1  -> 01
2  -> 00

(Why is the mapping like this? See this question about the one-hot encoder dropping the last category.)

（为什么这样的映射？请参阅关于丢弃最后一类的单热编码器的问题。）

The values in column categoryVecare exactly these but represented in sparse format. In this format the zeros of a vector are not printed. The first value (2) shows the length of the vector, the second value is an array that lists zero more indices where non zero entries are found and the third value is another array that tells which numbers are found at these indices. So (2,[0],[1.0]) means a vector of length 2 with 1.0 at position 0 and 0 elsewhere.

列categoryVecare中的值恰好是这些，但以稀疏格式表示。在这种格式中，不打印矢量的零。第一个值（2）显示向量的长度，第二个值是一个列出零个索引的数组，其中找到非零条目，第三个值是另一个数组，它告诉在这些索引处找到哪些数字。所以（2，[0]，[1.0]）表示长度为2的向量，在0位置为1.0，在其他位置为0。

See: https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector

请参阅：https：//spark.apache.org/docs/latest/mllib-data-types.html#local-vector

#1