数据分析神器Pandas快速入门3类型转换

pythontesting 2024-07-01 08:09:01 阅读 58

序列类型转换

3.1 自动转换

在pandas 1.0 中，引入了一种新的转换方法.convert_dtypes。它会尝试将Series 换为支持 pd.NA 类型。以city_mpg 系列为例，它将把类型从int64转换为Int64：

>>> city_mpg.convert_dtypes()

0 19

1 9

2 23

3 10

4 17

..

41139 19

41140 20

41141 18

41142 18

41143 16

Name: city08, Length: 41144, dtype: Int64

>>> city_mpg.astype('Int16')

0 19

1 9

2 23

3 10

4 17

..

41139 19

41140 20

41141 18

41142 18

41143 16

Name: city08, Length: 41144, dtype: Int16

>>> city_mpg.astype('Int8')

Traceback (most recent call last):

...

要指定系列数据的类型，可以尝试使用.astype 方法。我们的城市里程可以保存为16位整数，但8位整数就不行了，因为该符号类型的最大值是127，而我们有一些汽车的值是150。使用更窄的类型，就能减少内存使用，从而有更多内存处理更多数据。可以使用NumPy来检查整数和浮点类型的限制：

>>> np.iinfo('int64')

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

>>> np.iinfo('uint8')

iinfo(min=0, max=255, dtype=uint8)

>>> np.finfo('float16')

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

>>> np.finfo('float64')

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

参考资料

软件测试精品书籍文档下载持续更新 https://github.com/china-testing/python-testing-examples 请点赞，谢谢！
本文涉及的python测试开发库谢谢点赞！ https://github.com/china-testing/python_cn_resouce
python精品书籍下载 https://github.com/china-testing/python_cn_resouce/blob/main/python_good_books.md
Linux精品书籍下载 https://www.cnblogs.com/testing-/p/17438558.html

3.2 内存使用

要计算系列的内存使用情况，可以使用.nbytes属性或.memory_usage方法。后者在处理object类型时非常有用，因为可以通过deep=True来计算系列中 Python 对象的内存使用量。

>>> city_mpg.nbytes

329152

>>> city_mpg.astype('Int16').nbytes

123432

autos的make列包含字符串，并作为对象存储。要获得包含字符串的内存量需要使用.memory_usage 方法：

>>> make = df.make

>>> make.memory_usage()

KeyboardInterrupt

>>> make.nbytes

329152

>>> make.memory_usage()

329280

>>> make.memory_usage(deep=True)

2606395

.nbytes只是数据正在使用的内存，不含序列的辅助部分。.memory_usage包括索引内存，还可能包括object类型的等。

3.3 字符串和分类

如果向 .astype 方法传递str，它还可以将数字序列转换为字符串

>>> city_mpg.astype(str)

0 19

1 9

2 23

3 10

4 17

..

41139 19

41140 20

41141 18

41142 18

41143 16

Name: city08, Length: 41144, dtype: object

>>> city_mpg.astype(str)

0 19

1 9

2 23

3 10

4 17

..

41139 19

41140 20

41141 18

41142 18

41143 16

Name: city08, Length: 41144, dtype: object

分类序列对字符串数据非常有用，可以节省大量内存。这是因为当你有字符串数据时pandas会存储Python字符串。

当你将其转换为分类数据时，pandas不再为每个值使用Python字符串，而是对其进行优化，因此重复值不会重复。您仍然可以使用.str 属性的所有功能，但可能会节省大量内存（如果您有很多重复值）并提高性能，因为您不需要执行那么多字符串操作。

3.4 有序分类

要创建有序分类，需要定义自己的 CategoricalDtype：

>>> values = pd.Series(sorted(set(city_mpg)))

>>> city_type = pd.CategoricalDtype(categories=values,

... ordered=True)

>>> city_mpg.astype(city_type)

0 19

1 9

2 23

3 10

4 17

..

41139 19

41140 20

41141 18

41142 18

41143 16

Name: city08, Length: 41144, dtype: category

Categories (105, int64): [6 < 7 < 8 < 9 ... 137 < 138 < 140 < 150]

下表列出了可以传入 .astype 的类型。

3.5 其他类型

.to_numpy方法（或.values属性）返回NumPy数组，而.to_list返回Python列表。一般不要使用这些方法。如果直接使用 NumPy，有时会提高速度，但也有缺点。使用Python列表会大大降低代码速度。

如果你只想要单列的数据帧，你可以使用.to_frame 方法：

>>> city_mpg.to_frame()

city08

0 19

1 9

2 23

3 10

4 17

... ...

41139 19

41140 20

41141 18

41142 18

41143 16

[41144 rows x 1 columns]

此外，还有许多将数据导出为其他格式的转换方法，包括 CSV、Excel、HDF5、SQL、JSON 等。这些方法也存在于数据帧中，在序列中应用不多。

要转换为日期时间，请使用pandas中的 to_datetime 函数。如果要添加时区信息，则需要更多步骤。有关日期的章节将对此进行讨论。

上一篇：【C++高阶】掌握AVL树：构建与维护平衡二叉搜索树的艺术

下一篇： 7-8次PTA总结

本文标签

方法数据使用类型

声明

本文内容仅代表作者观点，或转载于其他网站，本站不以此文作为商业用途
如有涉及侵权，请联系本站进行删除
转载本站原创文章，请注明来源及作者。

数据分析神器Pandas快速入门3类型转换

序列类型转换

3.1 自动转换

参考资料

3.2 内存使用

3.3 字符串和分类

3.4 有序分类

3.5 其他类型

本文标签

声明

相关文章

阅读排行

热门文章