Python熊猫——将一些列类型更改为类别

时间:2023-01-12 00:06:50

I have fed the following CSV file into iPython Notebook:

我已将以下CSV文件输入iPython笔记本:

public = pd.read_csv("categories.csv")
public

I've also imported pandas as pd, numpy as np and matplotlib.pyplot as plt. The following data types are present (the below is a summary - there are about 100 columns)

我还进口了熊猫作为pd, numpy作为np和matplotlib。pyplot plt。现在有以下数据类型(以下是摘要—大约有100列)

In [36]:   public.dtypes
Out[37]:   parks          object
           playgrounds    object
           sports         object
           roading        object               
           resident       int64
           children       int64

I want to change 'parks', 'playgrounds', 'sports' and 'roading' to categories (they have likert scale responses in them - each column has different types of likert responses though (e.g. one has "strongly agree", "agree" etc., another has "very important", "important" etc.), leaving the remainder as int64.

我想把“parks”、“playgrounds”、“sports”和“roading”改为categories(它们有likert scale response)——每个专栏都有不同类型的likert response(例如,一个人有“strong agree”、“agree”等,另一个人有“very important”、“important”等),其余的都留作int64。

I was able to create a separate dataframe - public1 - and change one of the columns to a category type using the following code:

我能够创建一个单独的dataframe - public1,并使用以下代码将其中一列更改为类别类型:

public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')

However, when I tried to change a number at once using this code, I was unsuccessful:

但是,当我试图用这个代码一次更改一个号码时,我失败了:

public1 = {'parks': public.parks,
           'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')

Notwithstanding this, I don't want to create a separate dataframe with just the categories columns. I would like them changed in the original dataframe.

尽管如此,我不想创建一个单独的dataframe,只包含categories列。我希望它们在原始的dataframe中更改。

I tried numerous ways to achieve this, then tried the code here: Pandas: change data type of columns...

我尝试了很多方法来实现这一点,然后在这里尝试了代码:熊猫:更改列的数据类型……

public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')

and got the following error:

并得到以下错误:

 NotImplementedError: > 1 ndim Categorical are not supported at this time

Is there a way to change 'parks', 'playgrounds', 'sports', 'roading' to categories (so the likert scale responses can then be analysed), leaving 'resident' and 'children' (and the 94 other columns that are string, int + floats) untouched please? Or, is there a better way to do this? If anyone has any suggestions and/or feedback I would be most grateful....am slowly going bald ripping my hair out!

是否有办法改变“公园”、“游乐场”、“运动”、“漫游”到类别(因此可以分析likert尺度的反应)、离开“居民”和“孩子”(以及其他94列是字符串、int +浮点数)?或者,有没有更好的方法?如果任何人有任何建议和/或反馈我将不胜感激....我正在慢慢地秃顶,把头发扯下来!

Many thanks in advance.

提前感谢。

edited to add - I am using Python 2.7.

编辑后添加-我使用的是Python 2.7。

3 个解决方案

#1


39  

Sometimes, you just have to use a for-loop:

有时,你只需要使用for循环:

for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')

#2


16  

You can use the pandas.DataFrame.apply method along with a lambda expression to solve this. In your example you could use

你可以用pandas.DataFrame。应用方法和lambda表达式来解决这个问题。在您的示例中,您可以使用

df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))

I don't know of a way to execute this inplace, so typically I'll end up with something like this:

我不知道有什么方法来执行这个就地操作,所以我通常会得出这样的结论:

df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))

Obviously you can replace .select_dtypes with explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all object types).

显然,如果不希望选择所有特定的数据类型,可以用显式列名替换.select_dtypes(尽管在示例中似乎需要所有的对象类型)。

#3


7  

As of pandas 0.19.0, What's New describes that read_csv supports parsing Categorical columns directly. This answer applies only if you're starting from read_csv otherwise, I think unutbu's answer is still best. Example on 10,000 records:

对于panda 0.19.0,新的描述是read_csv直接支持解析直言列。这个答案只适用于从read_csv中开始,我认为unutbu的答案仍然是最好的。例10000条记录:

import pandas as pd
import numpy as np

# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
            'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'resident' : np.random.choice([1, 2, 3], size=N),
            'children' : np.random.choice([0, 1, 2, 3], size=N)
                       })
categories.to_csv('categories_large.csv', index=False)

<0.19.0 (or >=19.0 without specifying dtype)

pd.read_csv('categories_large.csv').dtypes # inspect default dtypes

children        int64
parks          object
playgrounds    object
resident        int64
roading        object
sports         object
dtype: object

>=0.19.0

For mixed dtypes parsing as Categorical can be implemented by passing a dictionary dtype={'colname' : 'category', ...} in read_csv.

对于将混合类型解析为Categorical的情况,可以通过传递字典dtype={'colname': 'category',…在read_csv }。

pd.read_csv('categories_large.csv', dtype={'parks': 'category',
                                           'playgrounds': 'category',
                                           'sports': 'category',
                                           'roading': 'category'}).dtypes
children          int64
parks          category
playgrounds    category
resident          int64
roading        category
sports         category
dtype: object

Performance

A slight speed-up (local jupyter notebook), as mentioned in the release notes.

一个轻微的加速(本地的jupyter笔记本),如发布说明中提到的。

# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop

# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop

#1


39  

Sometimes, you just have to use a for-loop:

有时,你只需要使用for循环:

for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')

#2


16  

You can use the pandas.DataFrame.apply method along with a lambda expression to solve this. In your example you could use

你可以用pandas.DataFrame。应用方法和lambda表达式来解决这个问题。在您的示例中,您可以使用

df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))

I don't know of a way to execute this inplace, so typically I'll end up with something like this:

我不知道有什么方法来执行这个就地操作,所以我通常会得出这样的结论:

df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))

Obviously you can replace .select_dtypes with explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all object types).

显然,如果不希望选择所有特定的数据类型,可以用显式列名替换.select_dtypes(尽管在示例中似乎需要所有的对象类型)。

#3


7  

As of pandas 0.19.0, What's New describes that read_csv supports parsing Categorical columns directly. This answer applies only if you're starting from read_csv otherwise, I think unutbu's answer is still best. Example on 10,000 records:

对于panda 0.19.0,新的描述是read_csv直接支持解析直言列。这个答案只适用于从read_csv中开始,我认为unutbu的答案仍然是最好的。例10000条记录:

import pandas as pd
import numpy as np

# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
            'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'resident' : np.random.choice([1, 2, 3], size=N),
            'children' : np.random.choice([0, 1, 2, 3], size=N)
                       })
categories.to_csv('categories_large.csv', index=False)

<0.19.0 (or >=19.0 without specifying dtype)

pd.read_csv('categories_large.csv').dtypes # inspect default dtypes

children        int64
parks          object
playgrounds    object
resident        int64
roading        object
sports         object
dtype: object

>=0.19.0

For mixed dtypes parsing as Categorical can be implemented by passing a dictionary dtype={'colname' : 'category', ...} in read_csv.

对于将混合类型解析为Categorical的情况,可以通过传递字典dtype={'colname': 'category',…在read_csv }。

pd.read_csv('categories_large.csv', dtype={'parks': 'category',
                                           'playgrounds': 'category',
                                           'sports': 'category',
                                           'roading': 'category'}).dtypes
children          int64
parks          category
playgrounds    category
resident          int64
roading        category
sports         category
dtype: object

Performance

A slight speed-up (local jupyter notebook), as mentioned in the release notes.

一个轻微的加速(本地的jupyter笔记本),如发布说明中提到的。

# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop

# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop