数据集划分方法

CSDN 2024-08-18 12:01:05 阅读 59

数据集划分是机器学习和数据科学中的一个重要步骤，主要目的是为了确保模型的有效性和可靠性。

留出法（简单交叉验证）

将数据集划分为互斥的子集：训练集和测试集。

训练集: 用于训练模型。

测试集: 用于评估模型的性能和验证其准确性。

交叉验证

将数据集分成多个子集，通常包括训练集、验证集和测试集。

训练集: 用于训练模型。

验证集: 用于调整模型的超参数和选择最佳模型。

测试集: 用于最终评估模型的性能。

留一（P）法

留一（P）法是一种特殊的交叉验证方法，每次从数据集中取出一（P）个样本作为测试集，其余样本作为训练集。

测试集: 每次包含单独的一条（P条）数据。

训练集: 包含剩余的所有数据。

自助法

自助法基于有放回采样的概念来生成训练集和测试集。

训练集: 从数据集

中随机抽取

个样本，允许重复抽样（有放回）。

测试集: 包含那些在抽样过程中没有被选中的

条数据。

<code># 从sklearn.datasets模块导入load_iris函数，用于加载鸢尾花数据集

from sklearn.datasets import load_iris

# 从collections模块导入Counter类，用于计算和显示数据中各类别的数量分布

from collections import Counter

# 从sklearn.model_selection模块导入数据集划分和交叉验证工具

from sklearn.model_selection import train_test_split # 用于随机划分数据集为训练集和测试集

from sklearn.model_selection import ShuffleSplit # 用于随机划分数据集的交叉验证

from sklearn.model_selection import StratifiedShuffleSplit # 用于按类别比例随机划分数据集

from sklearn.model_selection import KFold # 用于K折交叉验证

from sklearn.model_selection import StratifiedKFold # 用于按类别比例的K折交叉验证

from sklearn.model_selection import LeaveOneOut # 留一法的交叉验证

from sklearn.model_selection import LeavePOut # 留p法的交叉验证

# 导入pandas库，主要用于数据处理和操作

import pandas as pd

留出法

# 留出法(简单交叉验证)

def split_data():

# 留出法(随机分割)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=6)

print('随机分割后的测试集类别分布:', Counter(y_test))

# 留出法(分层分割)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=6)

print('分层分割后的测试集类别分布:', Counter(y_test))

# 划分多个训练集和测试集

def multiple_splits():

# 留出法(随机分割)

shuffle_splitter = ShuffleSplit(n_splits=2, test_size=0.2, random_state=6)

for train_indices, test_indices in shuffle_splitter.split(X, y):

print('随机分割后的测试集类别分布:', Counter(y[test_indices]))

# 留出法(分层分割)

stratified_splitter = StratifiedShuffleSplit(n_splits=2, test_size=0.2, random_state=6)

for train_indices, test_indices in stratified_splitter.split(X, y):

print('分层分割后的测试集类别分布:', Counter(y[test_indices]))

if __name__ == '__main__':

print("一次随机分割和一次分层分割：")

split_data()

print("多次随机分割和多次分层分割：")

multiple_splits()

一次随机分割和一次分层分割：

随机分割后的测试集类别分布: Counter({2: 11, 1: 10, 0: 9})

分层分割后的测试集类别分布: Counter({0: 10, 1: 10, 2: 10})

多次随机分割和多次分层分割：

随机分割后的测试集类别分布: Counter({2: 11, 1: 10, 0: 9})

随机分割后的测试集类别分布: Counter({2: 14, 0: 9, 1: 7})

分层分割后的测试集类别分布: Counter({0: 10, 1: 10, 2: 10})

分层分割后的测试集类别分布: Counter({1: 10, 0: 10, 2: 10})

交叉验证法

def cross_validation():

# 随机交叉验证

kf_splitter = KFold(n_splits=2, shuffle=True, random_state=6)

for train_indices, test_indices in kf_splitter.split(X, y):

print('随机交叉验证测试集类别分布:', Counter(y[test_indices]))

# 分层交叉验证

stratified_kf_splitter = StratifiedKFold(n_splits=2, shuffle=True, random_state=6)

for train_indices, test_indices in stratified_kf_splitter.split(X, y):

print('分层交叉验证测试集类别分布:', Counter(y[test_indices]))

if __name__ == '__main__':

cross_validation()

随机交叉验证测试集类别分布: Counter({0: 25, 1: 25, 2: 25})

随机交叉验证测试集类别分布: Counter({0: 25, 1: 25, 2: 25})

分层交叉验证测试集类别分布: Counter({0: 25, 1: 25, 2: 25})

分层交叉验证测试集类别分布: Counter({0: 25, 1: 25, 2: 25})

留一（P）法

def leave_one_and_p_out():

# 留一法

loo_splitter = LeaveOneOut()

print('留一法:')

count1 = 0

for train_indices, test_indices in loo_splitter.split(X, y):

if count1 < 5:

print(f'训练集样本数: {len(train_indices)}, 测试集样本数: {len(test_indices)}, 测试集索引: {test_indices}')

count1 += 1

else:

break

# 留p法

lpo_splitter = LeavePOut(p=3)

print('留p法:')

count2 = 0

for train_indices, test_indices in lpo_splitter.split(X, y):

if count2 < 5: # 仅打印前五个分割

print(f'训练集样本数: {len(train_indices)}, 测试集样本数: {len(test_indices)}, 测试集索引: {test_indices}')

count2 += 1

else:

break

if __name__ == '__main__':

leave_one_and_p_out()

留一法:

训练集样本数: 149, 测试集样本数: 1, 测试集索引: [0]

训练集样本数: 149, 测试集样本数: 1, 测试集索引: [1]

训练集样本数: 149, 测试集样本数: 1, 测试集索引: [2]

训练集样本数: 149, 测试集样本数: 1, 测试集索引: [3]

训练集样本数: 149, 测试集样本数: 1, 测试集索引: [4]

留p法:

训练集样本数: 147, 测试集样本数: 3, 测试集索引: [0 1 2]

训练集样本数: 147, 测试集样本数: 3, 测试集索引: [0 1 3]

训练集样本数: 147, 测试集样本数: 3, 测试集索引: [0 1 4]

训练集样本数: 147, 测试集样本数: 3, 测试集索引: [0 1 5]

训练集样本数: 147, 测试集样本数: 3, 测试集索引: [0 1 6]

自助法

iris = load_iris()

data = pd.DataFrame(data=iris.data[:5], columns=iris.feature_names)

data['target'] = iris.target[:5]

print('数据集:\n', data)

print('=' * 80)

# 产生训练集

train = data.sample(frac=1, replace=True, random_state=6)

print('训练集:\n', train)

print('=' * 80)

# 产生测试集

test = data.loc[data.index.difference(train.index)]

print('测试集:\n', test)

数据集:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

0 5.1 3.5 1.4 0.2 \

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

target

0 0

1 0

2 0

3 0

4 0

================================================================================

训练集:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

2 4.7 3.2 1.3 0.2 \

1 4.9 3.0 1.4 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

2 4.7 3.2 1.3 0.2

target

2 0

1 0

3 0

4 0

2 0

================================================================================

测试集:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

0 5.1 3.5 1.4 0.2 \

target

0 0

上一篇： IsaacLab从入门到精通（六）真机部署与Sim2real

下一篇： Maxkb——无需代码，快速构建自己的AI助手

本文标签

数据集划分方法

声明

本文内容仅代表作者观点，或转载于其他网站，本站不以此文作为商业用途
如有涉及侵权，请联系本站进行删除
转载本站原创文章，请注明来源及作者。