Feature Selection
Table of Contents
1. Feature Selection
ANN 能自己学习到不同特征的权重, 相当于自动完成了特征的提取. 那么对于那些完全无意义的特征, 它能否自己过滤掉, 从而完成特征的选择?
直觉上, 特征的选择与特征的提取应该是一样的, 都可以通过学习权重来完成.
1.1. 较大样本时自动特征选择
以 moon 为例, 一共有 12 个 feature, 其中后 10 个是手动加入了的 feature, 这些 feature 都是些正态分布的随机数.真正有用的 feature 只是前两个 feature.
import numpy as np from sklearn.datasets import make_moons from torch.utils.data import Dataset, DataLoader import torch NOISES = 10 N_SAMPLES = 2000 EPOCH=1000 model = torch.nn.Sequential( torch.nn.Linear(2 + NOISES, 10), torch.nn.BatchNorm1d(10), torch.nn.Tanh(), torch.nn.Linear(10, 1), torch.nn.Sigmoid()) class MoonDataset(Dataset): def __init__(self): X, Y = make_moons(n_samples=N_SAMPLES, noise=0.2) r = np.random.randn(N_SAMPLES, NOISES) X = np.c_[(X, r)] self.X = torch.from_numpy(X).float() self.Y = torch.from_numpy(Y).float().view(-1, 1) def __getitem__(self, index): return self.X[index], self.Y[index] def __len__(self): return len(self.X) training_set = MoonDataset() training_loader = DataLoader(training_set, batch_size=100) test_set = MoonDataset() test_loader = DataLoader(test_set, batch_size=N_SAMPLES) criterion = torch.nn.BCELoss() optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.001) def train(): model.train() for i in range(EPOCH): for x, y in training_loader: loss = criterion(model(x), y) optimizer.zero_grad() loss.backward() optimizer.step() # if i % 20 == 0: print("loss:", loss.item()) def test(): model.eval() for x, y in training_loader: y_hat = model(x) y_hat = y_hat > 0.5 accu = torch.sum(y_hat == y.byte()).item() / 100 print("train:", accu) break for x, y in test_loader: y_hat = model(x) y_hat = y_hat > 0.5 accu = torch.sum(y_hat == y.byte()).item() / N_SAMPLES print("test:", accu) train() test() print(torch.abs(model[0].weight).mean(dim=0))
loss: 0.0661696270108223 train: 0.98 test: 0.951 tensor([ 0.2313, 0.1568, 0.0413, 0.0546, 0.0506, 0.0507, 0.0411, 0.0503, 0.0455, 0.0518, 0.0528, 0.0311])
可见即使 feature 中有许多 noise, 结果还是正确的. ANN 学习的权重中, 对应 noise 的权重明显要小很多.
1.2. 较小样本时过拟合
import numpy as np from sklearn.datasets import make_moons from torch.utils.data import Dataset, DataLoader import torch NOISES = 10 N_SAMPLES = 200 EPOCH=2000 model = torch.nn.Sequential( torch.nn.Linear(2 + NOISES, 10), torch.nn.BatchNorm1d(10), torch.nn.Tanh(), torch.nn.Linear(10, 1), torch.nn.Sigmoid()) class MoonDataset(Dataset): def __init__(self): X, Y = make_moons(n_samples=N_SAMPLES, noise=0.2) r = np.random.randn(N_SAMPLES, NOISES) X = np.c_[(X, r)] self.X = torch.from_numpy(X).float() self.Y = torch.from_numpy(Y).float().view(-1, 1) def __getitem__(self, index): return self.X[index], self.Y[index] def __len__(self): return len(self.X) training_set = MoonDataset() training_loader = DataLoader(training_set, batch_size=100) test_set = MoonDataset() test_loader = DataLoader(test_set, batch_size=N_SAMPLES) criterion = torch.nn.BCELoss() optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.001) train() test() print(torch.abs(model[0].weight).mean(dim=0))
loss: 0.028859596699476242 train: 1.0 test: 0.85 tensor([ 0.0490, 0.1316, 0.0637, 0.0618, 0.0308, 0.0546, 0.0256, 0.0412, 0.0591, 0.0451, 0.0409, 0.0325])
- 当样本较小时, ANN 显然并没有区分出无用的 feature
- train 准确率为 1, test 准确率 0.85, 明显是一种过拟合的状态.
可见, ANN 有足够的样本时, 可以自己选择 feature, 但样本不足时, 容易学习到与无关 feature 的错误的相关性(把特例当作通性), 造成过拟合.
1.3. 噪声特征
ANN 在较大样本时能自动去除噪声特征, 什么样的特征是噪声特征?
直觉上, 随机的特征应该是噪声, 例如 randn 和 rand 函数生成的特征.
下面的例子有两个 feature: (x1,x2), 真值为 200*x1, 即 x2 是无用的 feature, 看 ANN 能否发现. 若 ANN 能发现 x2, 则 w1 应接近于 200, w2 应接近到 0
1.3.1. x1,x2 随机
import torch from itertools import count torch.manual_seed(1000) w1, w2 = torch.randn(1), torch.randn(1) w1.requires_grad = True w2.requires_grad = True criterion = torch.nn.MSELoss() optimizer = torch.optim.SGD([w1, w2], lr=1e-3) N_SAMPLES = 2 def train(): for i in count(): y_hat = w1 * x1 + w2 * x2 y = 200 * x1 loss = criterion(y_hat, y) optimizer.zero_grad() loss.backward() optimizer.step() if loss.item() < 0.02: break print("loss:", loss.item(), " epoch: ", i) x1 = torch.rand(N_SAMPLES, 1) x2 = torch.rand(N_SAMPLES, 1) train() print(w1, w2)
loss: 0.019999535754323006 epoch: 118101 tensor([ 199.3793]) tensor([ 0.7613])
1.3.2. 更多的样本
import torch from itertools import count torch.manual_seed(1000) w1, w2 = torch.randn(1), torch.randn(1) w1.requires_grad = True w2.requires_grad = True criterion = torch.nn.MSELoss() optimizer = torch.optim.SGD([w1, w2], lr=1e-3) N_SAMPLES = 200 def train(): for i in count(): y_hat = w1 * x1 + w2 * x2 y = 200 * x1 loss = criterion(y_hat, y) optimizer.zero_grad() loss.backward() optimizer.step() if loss.item() < 0.02: break print("loss:", loss.item(), " epoch: ", i) x1 = torch.rand(N_SAMPLES, 1) x2 = torch.rand(N_SAMPLES, 1) train() print(w1, w2)
loss: 0.01999880000948906 epoch: 30926 tensor([ 199.6702]) tensor([ 0.3272])
1.3.3. x1,x2 有相关性
import torch from itertools import count torch.manual_seed(1000) w1, w2 = torch.randn(1), torch.randn(1) w1.requires_grad = True w2.requires_grad = True criterion = torch.nn.MSELoss() optimizer = torch.optim.SGD([w1, w2], lr=1e-3) N_SAMPLES = 200 def train(): for i in count(): y_hat = w1 * x1 + w2 * x2 y = 200 * x1 loss = criterion(y_hat, y) optimizer.zero_grad() loss.backward() optimizer.step() if loss.item() < 0.02: break print("loss:", loss.item(), " epoch: ", i) x1 = torch.rand(N_SAMPLES, 1) x2 = x1 * 2 train() print(w1, w2)
loss: 0.019913002848625183 epoch: 1968 tensor([ 39.1714]) tensor([ 80.2940])
1.3.4. 分析
假设 N_SAMPLES=2 \(loss=MSE(\hat{y},y)=\big((w_1-200)x_1^1+w_2x_1^2\big)^2+\big((w_1-200)x_2^1+w_2x_2^2\big)^2\\ =(w_1-200)^2((x_1^1)^2+(x_2^1)^2)+w_2^2((x_1^2)^2+(x_2^2)^2)+2(w_1-100)w_2(x_1^1x_1^2+x_2^1x_2^2)\)
考虑 \(f(x,y)=x^2+y^2+kxy\), 通过求导及观察函数的图形 academo 可以发现:
- 当 k=2 或 -2 时, 函数在 \(y=-x\) 线上都有极小值为 0
- 当 -2<k<2 时, 函数极小值在 (0,0) 处
所以 \(-1 < \frac{x_1^1x_1^2+x_2^1x_2^2}{\sqrt{(x_1^1)^2+(x_2^1)^2}*\sqrt{(x_1^2)^2+(x_2^2)^2}} < 1\) 时, 函数有极小值在 (200,0) 处
而上面这个公式, 实际上是两个向量的 cosine similarity
- 若针对所有样本, 噪声与其它特征方向不完全相同, 则这个特征向量中的这个噪声可以被 ANN 丢弃.
- 样本越多,噪声越容易被发现, 因为样本越多, 计算 cosine similarity 时向量长度越大, 其 similarity 越不容易为 1
1.4. 结论
只要有足够多的样本, ANN 能自动完成特征的提取和选择. 在实际应用中, 只要在样本不足,或 feature 过多时, 才可能需要手动的做一下特征提取与特征选择.