深度学习中怎么使用交叉验证

梦里梦见梦不见的497 2024-09-20 09:31:01 阅读 87

简单来说,就是将数据划分为n分,(n-1)份用于训练,1份用于验证,大概就是n折交叉验证了。

假如我的训练集有100个样本,在训练的时候使用5折交叉验证,就是把这100个样本分为5份,4份用于训练,1份用于验证。而分成的5份分别做一次验证集,就是交叉了5次。

这里举个例子:

<code>import numpy as np

from sklearn.model_selection import KFold

# K折交叉验证在训练集上

kf = KFold(n_splits=5, shuffle=True, random_state=42)

val_accuracies = []

for train_index, val_index in kf.split(range(100)):

print(train_index)

print("数据的长度为:",len(train_index))

print("-----------------------")

print(val_index)

print("数据的长度为:",len(val_index))

print("**********************************************************************")

[ 1 2 3 5 6 7 8 9 11 13 14 15 16 17 19 20 21 23 24 25 26 27 28 29

32 34 35 36 37 38 40 41 42 43 46 47 48 49 50 51 52 54 55 56 57 58 59 60

61 62 63 64 65 66 67 68 69 71 72 74 75 78 79 81 82 84 85 86 87 88 89 91

92 93 94 95 96 97 98 99]

数据的长度为: 80

-----------------------

[ 0 4 10 12 18 22 30 31 33 39 44 45 53 70 73 76 77 80 83 90]

数据的长度为: 20

**********************************************************************

[ 0 1 2 3 4 6 7 8 10 12 13 14 17 18 19 20 21 22 23 24 25 27 29 30

31 32 33 34 36 37 38 39 41 43 44 45 46 48 49 50 51 52 53 54 56 57 58 59

60 61 62 63 64 67 68 70 71 73 74 75 76 77 78 79 80 81 82 83 84 86 87 89

90 91 92 94 95 97 98 99]

数据的长度为: 80

-----------------------

[ 5 9 11 15 16 26 28 35 40 42 47 55 65 66 69 72 85 88 93 96]

数据的长度为: 20

**********************************************************************

[ 0 1 2 4 5 9 10 11 12 14 15 16 18 20 21 22 23 26 28 29 30 31 32 33

35 37 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 56 57 58 59 60 61

63 65 66 67 68 69 70 71 72 73 74 75 76 77 79 80 82 83 84 85 86 87 88 90

91 92 93 94 96 97 98 99]

数据的长度为: 80

-----------------------

[ 3 6 7 8 13 17 19 24 25 27 34 36 38 49 62 64 78 81 89 95]

数据的长度为: 20

**********************************************************************

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 33 34 35 36 37 38 39 40 42 44 45 47 49 51 52 53

55 60 62 63 64 65 66 69 70 71 72 73 74 76 77 78 80 81 82 83 84 85 86 87

88 89 90 91 92 93 95 96]

数据的长度为: 80

-----------------------

[32 41 43 46 48 50 54 56 57 58 59 61 67 68 75 79 94 97 98 99]

数据的长度为: 20

**********************************************************************

[ 0 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 22 24 25 26 27 28 30

31 32 33 34 35 36 38 39 40 41 42 43 44 45 46 47 48 49 50 53 54 55 56 57

58 59 61 62 64 65 66 67 68 69 70 72 73 75 76 77 78 79 80 81 83 85 88 89

90 93 94 95 96 97 98 99]

数据的长度为: 80

-----------------------

[ 1 2 14 20 21 23 29 37 51 52 60 63 71 74 82 84 86 87 91 92]

数据的长度为: 20

**********************************************************************

所以说,在深度学习的使用交叉验证时,就是将数据进行划分,分别训练计算。通常我们是划分为训练集、验证集和测试集。对训练集做交叉验证就是将训练集在划分为训练集和验证集。对测试集做交叉验证就是对测试集划分为训练集和验证集。

下面举个深度学习的例子:

import torch

import torch.nn as nn

import torch.optim as optim

from torch.utils.data import DataLoader, Subset

from torchvision import datasets, transforms

from sklearn.model_selection import KFold

# 定义简单的全连接神经网络

class SimpleNN(nn.Module):

def __init__(self):

super(SimpleNN, self).__init__()

self.fc1 = nn.Linear(28 * 28, 128)

self.fc2 = nn.Linear(128, 64)

self.fc3 = nn.Linear(64, 10)

def forward(self, x):

x = torch.flatten(x, 1)

x = torch.relu(self.fc1(x))

x = torch.relu(self.fc2(x))

x = self.fc3(x)

return x

# 加载数据集

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

full_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)code>

test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)code>

# 按照7:1:2的比例划分数据集

num_train = int(len(full_dataset) * 0.7)

num_val = int(len(full_dataset) * 0.1)

num_test = len(full_dataset) - num_train - num_val

train_dataset, val_dataset, _ = random_split(full_dataset, [num_train, num_val, num_test])

# 定义KFold对象

kf_train = KFold(n_splits=5, shuffle=True, random_state=42)

kf_test = KFold(n_splits=5, shuffle=True, random_state=42)

# 在训练集上进行交叉验证

val_accuracies = []

train_losses = []

val_losses = []

for train_index, val_index in kf_train.split(range(num_train)):

train_kfold_subset = Subset(train_dataset, train_index)

val_kfold_subset = Subset(train_dataset, val_index)

train_loader = DataLoader(train_kfold_subset, batch_size=32, shuffle=True)

val_loader = DataLoader(val_kfold_subset, batch_size=32, shuffle=False)

model = SimpleNN()

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=0.001)

# 训练模型

for epoch in range(10):

model.train()

epoch_train_loss = 0

for data, target in train_loader:

optimizer.zero_grad()

output = model(data)

loss = criterion(output, target)

loss.backward()

optimizer.step()

epoch_train_loss += loss.item() * data.size(0)

avg_train_loss = epoch_train_loss / len(train_loader.dataset)

# 验证模型

model.eval()

correct = 0

total = 0

epoch_val_loss = 0

with torch.no_grad():

for data, target in val_loader:

output = model(data)

loss = criterion(output, target)

_, predicted = torch.max(output, 1)

total += target.size(0)

correct += (predicted == target).sum().item()

epoch_val_loss += loss.item() * data.size(0)

avg_val_loss = epoch_val_loss / len(val_loader.dataset)

val_accuracy = correct / total

val_accuracies.append(val_accuracy)

train_losses.append(avg_train_loss)

val_losses.append(avg_val_loss)

print(f"Fold - Training Loss: {avg_train_loss:.4f}, Validation Loss: {avg_val_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}")

# 计算平均验证准确率和损失

average_val_accuracy = sum(val_accuracies) / len(val_accuracies)

average_train_loss = sum(train_losses) / len(train_losses)

average_val_loss = sum(val_losses) / len(val_losses)

print(f"Average Validation Accuracy: {average_val_accuracy:.4f}")

print(f"Average Training Loss: {average_train_loss:.4f}")

print(f"Average Validation Loss: {average_val_loss:.4f}")

# 在测试集上进行交叉验证

test_accuracies = []

for train_index, test_index in kf_test.split(range(len(test_dataset))):

test_kfold_subset = Subset(test_dataset, test_index)

test_loader = DataLoader(test_kfold_subset, batch_size=32, shuffle=False)

# 使用整个训练集训练最终模型

model = SimpleNN()

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=0.001)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

for epoch in range(10):

model.train()

for data, target in train_loader:

optimizer.zero_grad()

output = model(data)

loss = criterion(output, target)

loss.backward()

optimizer.step()

# 测试模型

model.eval()

correct = 0

total = 0

with torch.no_grad():

for data, target in test_loader:

output = model(data)

_, predicted = torch.max(output, 1)

total += target.size(0)

correct += (predicted == target).sum().item()

test_accuracy = correct / total

test_accuracies.append(test_accuracy)

print(f"Test Accuracy: {test_accuracy:.4f}")

# 计算平均测试准确率

average_test_accuracy = sum(test_accuracies) / len(test_accuracies)

print(f"Average Test Accuracy: {average_test_accuracy:.4f}")

解释

数据准备

加载MNIST数据集。按照7:1:2的比例划分数据集为训练集、验证集和测试集。

K折交叉验证

kf_train用于训练集的交叉验证。kf_test用于测试集的交叉验证。

训练集上的交叉验证

在训练集上进行5折交叉验证。对于每个折叠,使用train_index创建训练子集,使用val_index创建验证子集。在训练集子集上训练模型,并在验证子集上评估模型性能。记录每个折叠的验证准确率并计算平均值。

测试集上的交叉验证

在测试集上进行5折交叉验证。对于每个折叠,使用test_index创建测试子集。使用整个训练集训练最终模型,并在测试集子集上评估模型性能。记录每个折叠的测试准确率并计算平均值。

注意事项

在实际应用中,通常交叉验证主要用于训练集和验证集。测试集主要用于评估最终模型的性能。测试集上进行交叉验证可能会增加计算复杂性,但也能提供更稳健的性能评估。确保每个折叠中的数据被正确划分,以避免数据泄漏和过拟合。



声明

本文内容仅代表作者观点,或转载于其他网站,本站不以此文作为商业用途
如有涉及侵权,请联系本站进行删除
转载本站原创文章,请注明来源及作者。