为什么我们需要在PyTorch中调用zero_grad()? - 问答

在PyTorch中，我们需要在开始进行反向传播之前将梯度设置为零，因为PyTorch会在随后的反向传播中累积梯度。

因此，默认是在每次调用loss.backward()时累积（即求和）梯度。

因此，理想情况下，当开始训练循环时，应该使用zero out the gradients来正确进行参数更新。否则，梯度将指向预期方向（即朝向最小值或最大值）以外的其他方向。

这是一个简单的示例：

import torchfrom torch.autograd import Variableimport torch.optim as optimdef linear_model(x, W, b):
    return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

或者，如果你在做香草梯度下降：

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

注意事项：

当在 tensor.上调用 .backward()时，会发生梯度的累积（即求和）。

从v1.7.0开始，可以选择使用None optimizer.zero_grad(set_to_none=True)重置梯度，而不用零张量填充梯度。这将导致内存减少，性能略有提高，但是如果不小心处理，可能会容易出错。

Via:https://stackoverflow.com/a/48009142/14964791

2021-01-30 11:14 更新

karry • 4540

理工酷

首页

圈子

资源下载

邀请回答

推荐问题

推荐资源

加入组织

理工酷

首页

圈子

资源下载

站外资源

问答

网址导航

邀请回答 换一组

推荐问题

推荐资源

加入组织

邀请回答