深度学习中怎么使用交叉验证
梦里梦见梦不见的497 2024-09-20 09:31:01 阅读 87
简单来说,就是将数据划分为n分,(n-1)份用于训练,1份用于验证,大概就是n折交叉验证了。
假如我的训练集有100个样本,在训练的时候使用5折交叉验证,就是把这100个样本分为5份,4份用于训练,1份用于验证。而分成的5份分别做一次验证集,就是交叉了5次。
这里举个例子:
<code>import numpy as np
from sklearn.model_selection import KFold
# K折交叉验证在训练集上
kf = KFold(n_splits=5, shuffle=True, random_state=42)
val_accuracies = []
for train_index, val_index in kf.split(range(100)):
print(train_index)
print("数据的长度为:",len(train_index))
print("-----------------------")
print(val_index)
print("数据的长度为:",len(val_index))
print("**********************************************************************")
[ 1 2 3 5 6 7 8 9 11 13 14 15 16 17 19 20 21 23 24 25 26 27 28 29
32 34 35 36 37 38 40 41 42 43 46 47 48 49 50 51 52 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 71 72 74 75 78 79 81 82 84 85 86 87 88 89 91
92 93 94 95 96 97 98 99]
数据的长度为: 80
-----------------------
[ 0 4 10 12 18 22 30 31 33 39 44 45 53 70 73 76 77 80 83 90]
数据的长度为: 20
**********************************************************************
[ 0 1 2 3 4 6 7 8 10 12 13 14 17 18 19 20 21 22 23 24 25 27 29 30
31 32 33 34 36 37 38 39 41 43 44 45 46 48 49 50 51 52 53 54 56 57 58 59
60 61 62 63 64 67 68 70 71 73 74 75 76 77 78 79 80 81 82 83 84 86 87 89
90 91 92 94 95 97 98 99]
数据的长度为: 80
-----------------------
[ 5 9 11 15 16 26 28 35 40 42 47 55 65 66 69 72 85 88 93 96]
数据的长度为: 20
**********************************************************************
[ 0 1 2 4 5 9 10 11 12 14 15 16 18 20 21 22 23 26 28 29 30 31 32 33
35 37 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 56 57 58 59 60 61
63 65 66 67 68 69 70 71 72 73 74 75 76 77 79 80 82 83 84 85 86 87 88 90
91 92 93 94 96 97 98 99]
数据的长度为: 80
-----------------------
[ 3 6 7 8 13 17 19 24 25 27 34 36 38 49 62 64 78 81 89 95]
数据的长度为: 20
**********************************************************************
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 33 34 35 36 37 38 39 40 42 44 45 47 49 51 52 53
55 60 62 63 64 65 66 69 70 71 72 73 74 76 77 78 80 81 82 83 84 85 86 87
88 89 90 91 92 93 95 96]
数据的长度为: 80
-----------------------
[32 41 43 46 48 50 54 56 57 58 59 61 67 68 75 79 94 97 98 99]
数据的长度为: 20
**********************************************************************
[ 0 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 22 24 25 26 27 28 30
31 32 33 34 35 36 38 39 40 41 42 43 44 45 46 47 48 49 50 53 54 55 56 57
58 59 61 62 64 65 66 67 68 69 70 72 73 75 76 77 78 79 80 81 83 85 88 89
90 93 94 95 96 97 98 99]
数据的长度为: 80
-----------------------
[ 1 2 14 20 21 23 29 37 51 52 60 63 71 74 82 84 86 87 91 92]
数据的长度为: 20
**********************************************************************
所以说,在深度学习的使用交叉验证时,就是将数据进行划分,分别训练计算。通常我们是划分为训练集、验证集和测试集。对训练集做交叉验证就是将训练集在划分为训练集和验证集。对测试集做交叉验证就是对测试集划分为训练集和验证集。
下面举个深度学习的例子:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Subset
from torchvision import datasets, transforms
from sklearn.model_selection import KFold
# 定义简单的全连接神经网络
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(28 * 28, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = torch.flatten(x, 1)
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
# 加载数据集
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
full_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)code>
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)code>
# 按照7:1:2的比例划分数据集
num_train = int(len(full_dataset) * 0.7)
num_val = int(len(full_dataset) * 0.1)
num_test = len(full_dataset) - num_train - num_val
train_dataset, val_dataset, _ = random_split(full_dataset, [num_train, num_val, num_test])
# 定义KFold对象
kf_train = KFold(n_splits=5, shuffle=True, random_state=42)
kf_test = KFold(n_splits=5, shuffle=True, random_state=42)
# 在训练集上进行交叉验证
val_accuracies = []
train_losses = []
val_losses = []
for train_index, val_index in kf_train.split(range(num_train)):
train_kfold_subset = Subset(train_dataset, train_index)
val_kfold_subset = Subset(train_dataset, val_index)
train_loader = DataLoader(train_kfold_subset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_kfold_subset, batch_size=32, shuffle=False)
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 训练模型
for epoch in range(10):
model.train()
epoch_train_loss = 0
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
epoch_train_loss += loss.item() * data.size(0)
avg_train_loss = epoch_train_loss / len(train_loader.dataset)
# 验证模型
model.eval()
correct = 0
total = 0
epoch_val_loss = 0
with torch.no_grad():
for data, target in val_loader:
output = model(data)
loss = criterion(output, target)
_, predicted = torch.max(output, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
epoch_val_loss += loss.item() * data.size(0)
avg_val_loss = epoch_val_loss / len(val_loader.dataset)
val_accuracy = correct / total
val_accuracies.append(val_accuracy)
train_losses.append(avg_train_loss)
val_losses.append(avg_val_loss)
print(f"Fold - Training Loss: {avg_train_loss:.4f}, Validation Loss: {avg_val_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}")
# 计算平均验证准确率和损失
average_val_accuracy = sum(val_accuracies) / len(val_accuracies)
average_train_loss = sum(train_losses) / len(train_losses)
average_val_loss = sum(val_losses) / len(val_losses)
print(f"Average Validation Accuracy: {average_val_accuracy:.4f}")
print(f"Average Training Loss: {average_train_loss:.4f}")
print(f"Average Validation Loss: {average_val_loss:.4f}")
# 在测试集上进行交叉验证
test_accuracies = []
for train_index, test_index in kf_test.split(range(len(test_dataset))):
test_kfold_subset = Subset(test_dataset, test_index)
test_loader = DataLoader(test_kfold_subset, batch_size=32, shuffle=False)
# 使用整个训练集训练最终模型
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
for epoch in range(10):
model.train()
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# 测试模型
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in test_loader:
output = model(data)
_, predicted = torch.max(output, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
test_accuracy = correct / total
test_accuracies.append(test_accuracy)
print(f"Test Accuracy: {test_accuracy:.4f}")
# 计算平均测试准确率
average_test_accuracy = sum(test_accuracies) / len(test_accuracies)
print(f"Average Test Accuracy: {average_test_accuracy:.4f}")
解释
数据准备:
加载MNIST数据集。按照7:1:2的比例划分数据集为训练集、验证集和测试集。
K折交叉验证:
kf_train
用于训练集的交叉验证。kf_test
用于测试集的交叉验证。
训练集上的交叉验证:
在训练集上进行5折交叉验证。对于每个折叠,使用
train_index
创建训练子集,使用val_index
创建验证子集。在训练集子集上训练模型,并在验证子集上评估模型性能。记录每个折叠的验证准确率并计算平均值。
测试集上的交叉验证:
在测试集上进行5折交叉验证。对于每个折叠,使用
test_index
创建测试子集。使用整个训练集训练最终模型,并在测试集子集上评估模型性能。记录每个折叠的测试准确率并计算平均值。
注意事项
在实际应用中,通常交叉验证主要用于训练集和验证集。测试集主要用于评估最终模型的性能。测试集上进行交叉验证可能会增加计算复杂性,但也能提供更稳健的性能评估。确保每个折叠中的数据被正确划分,以避免数据泄漏和过拟合。
声明
本文内容仅代表作者观点,或转载于其他网站,本站不以此文作为商业用途
如有涉及侵权,请联系本站进行删除
转载本站原创文章,请注明来源及作者。