Python实现对相同数据分箱的小技巧分享_Python

Python实现对相同数据分箱的小技巧分享

2022-08-28 12:08小澜ovo Python

这篇文章主要给大家介绍了关于Python实现对相同数据分箱的小技巧,文中通过实例代码介绍的非常详细,对大家学习或者使用python具有一定的参考学习价值,需要的朋友可以参考下

前言

博主最近工作中刚好用到数据分箱操作（对相同数据进行遍历比较，避免了全部遍历比较，大大减少了电脑IO次数，提高程序运行速度），翻了很多博文都没有找到解决方法，写一下我自己的解决思路！！！

什么是分箱？

简单点说就是将不同的东西，按照特定的条件放到一个指定容器里，比如水果把绿色的放一个篮子里，红色一个篮子等等，这个篮子就是箱，而水果就是数据颜色就是条件

什么样式的数据要进行分箱

数据主要分为连续变量和分类变量，分箱的操作主要针对于连续变量。

为什么要对数据进行分箱操作

稳定性，时间复杂度，看的舒服，提高准确度等等

思路

先给定 last 为列表第一个（并存入temp列表）,将后面的数据从第二个开始与 last 比较，如果相同存入 temp 中。

当不相同时，则将 last 切换为不同的那个数（并存入temp），并将 temp列表放入一个空列表中。

类型一：数字

实现效果

				?

									[1,1,1,2,2,2,3,3,4,4,5,5,5,5,5]

									# 转变为

									[[1, 1, 1], [2, 2, 2], [3, 3], [4, 4], [5, 5, 5, 5, 5]]

代码实现

				?

									box = [1,1,1,2,2,2,3,3,4,4,5,5,5,5,5]

									last = box[0]

									temp = [box[0]]

									box_list = [temp]

									for a in box[1::]:

									    if a == last:

									        temp.append(a)

									    else:

									        last = a

									        temp = [a]

									        box_list.append(temp)

									print(box_list) # [[1, 1, 1], [2, 2, 2], [3, 3], [4, 4], [5, 5, 5, 5, 5]]

									# 实现按每一个分箱列表遍历数据（而不用全部遍历）

									for boxs in box_list:

									    for i in boxs:

									        print(i)

类型二：元组

实现效果

				?

									box = [('小黑','20','四川'),('小黑','21','北京'),('张三','18','上海'),('张三','22','上海'),('张三','30','北京'),('李四','10','广州')]

									# 实现把名字相同的元组放入一个列表

									[[('小黑', '20', '四川'), ('小黑', '21', '北京')], [('张三', '18', '上海'), ('张三', '22', '上海'), ('张三', '30', '北京')], [('李四', '10', '广州')]]

代码实现

				?

									box = [('小黑','20','四川'),('小黑','21','北京'),('张三','18','上海'),('张三','22','上海'),('张三','30','北京'),('李四','10','广州')]

									last = box[0][0]

									temp = [box[0]]

									box_list = [temp]

									for a in box[1::]:

									    if a[0] == last:

									        temp.append(a)

									    else:

									        last = a[0]

									        temp = [a]

									        box_list.append(temp)

									print(box_list)    

									# 实现按每一个分箱列表遍历数据（而不用全部遍历）

									for boxs in box_list:

									    for i in boxs:

									        print(i[0]) # 0取的姓名，1取年龄，3取地址

附：利用Python的cut方法可以对数据进行分箱。

				?

									import pandas as pd 

									import numpy as np 

									from pandas import Series,DataFrame

									# 随机生成一组数据

									score_list = np.random.randint(25,100,size = 20)  # 随机生成最小值25，最大值100的20个数据

									# 分箱的区间

									bins = [0,59,70,80,100]

									# 分箱

									score_cat = pd.cut(score_list,bins)

									# 统计不同区间的个数

									pd.value_counts(score_cat)

									# 生成一个空的DataFrame

									df = DataFrame()

									df['Score'] = score_list

									df['Name'] =  [pd.util.testing.rands(5) for i in range(20)] # 生成20个姓名

									df['Categories'] =pd.cut(df['Score'],bins,labels = ['不及格','一般','优秀','厉害']) 

									# labels对应的是bins的