Feature Selection

1. Feature Selection

1. Feature Selection

ANN 能自己学习到不同特征的权重, 相当于自动完成了特征的提取. 那么对于那些完全无意义的特征, 它能否自己过滤掉, 从而完成特征的选择?

直觉上, 特征的选择与特征的提取应该是一样的, 都可以通过学习权重来完成.

1.1. 较大样本时自动特征选择

以 moon 为例, 一共有 12 个 feature, 其中后 10 个是手动加入了的 feature, 这些 feature 都是些正态分布的随机数.真正有用的 feature 只是前两个 feature.

import numpy as np
from sklearn.datasets import make_moons
from torch.utils.data import Dataset, DataLoader
import torch

NOISES = 10
N_SAMPLES = 2000
EPOCH=1000

model = torch.nn.Sequential(
    torch.nn.Linear(2 + NOISES, 10), torch.nn.BatchNorm1d(10), torch.nn.Tanh(),
    torch.nn.Linear(10, 1), torch.nn.Sigmoid())


class MoonDataset(Dataset):
    def __init__(self):
        X, Y = make_moons(n_samples=N_SAMPLES, noise=0.2)
        r = np.random.randn(N_SAMPLES, NOISES)
        X = np.c_[(X, r)]
        self.X = torch.from_numpy(X).float()
        self.Y = torch.from_numpy(Y).float().view(-1, 1)

    def __getitem__(self, index):
        return self.X[index], self.Y[index]

    def __len__(self):
        return len(self.X)


training_set = MoonDataset()
training_loader = DataLoader(training_set, batch_size=100)

test_set = MoonDataset()
test_loader = DataLoader(test_set, batch_size=N_SAMPLES)

criterion = torch.nn.BCELoss()

optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.001)


def train():
    model.train()
    for i in range(EPOCH):
        for x, y in training_loader:
            loss = criterion(model(x), y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        # if i % 20 == 0:
    print("loss:", loss.item())


def test():
    model.eval()
    for x, y in training_loader:
        y_hat = model(x)
        y_hat = y_hat > 0.5
        accu = torch.sum(y_hat == y.byte()).item() / 100
        print("train:", accu)
        break

    for x, y in test_loader:
        y_hat = model(x)
        y_hat = y_hat > 0.5
        accu = torch.sum(y_hat == y.byte()).item() / N_SAMPLES
        print("test:", accu)


train()
test()
print(torch.abs(model[0].weight).mean(dim=0))

loss: 0.0661696270108223 train: 0.98 test: 0.951 tensor([ 0.2313, 0.1568, 0.0413, 0.0546, 0.0506, 0.0507, 0.0411, 0.0503, 0.0455, 0.0518, 0.0528, 0.0311])

可见即使 feature 中有许多 noise, 结果还是正确的. ANN 学习的权重中, 对应 noise 的权重明显要小很多.

1.2. 较小样本时过拟合

import numpy as np
from sklearn.datasets import make_moons
from torch.utils.data import Dataset, DataLoader
import torch

NOISES = 10
N_SAMPLES = 200
EPOCH=2000

model = torch.nn.Sequential(
    torch.nn.Linear(2 + NOISES, 10), torch.nn.BatchNorm1d(10), torch.nn.Tanh(),
    torch.nn.Linear(10, 1), torch.nn.Sigmoid())


class MoonDataset(Dataset):
    def __init__(self):
        X, Y = make_moons(n_samples=N_SAMPLES, noise=0.2)
        r = np.random.randn(N_SAMPLES, NOISES)
        X = np.c_[(X, r)]
        self.X = torch.from_numpy(X).float()
        self.Y = torch.from_numpy(Y).float().view(-1, 1)

    def __getitem__(self, index):
        return self.X[index], self.Y[index]

    def __len__(self):
        return len(self.X)


training_set = MoonDataset()
training_loader = DataLoader(training_set, batch_size=100)

test_set = MoonDataset()
test_loader = DataLoader(test_set, batch_size=N_SAMPLES)

criterion = torch.nn.BCELoss()

optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.001)

train()
test()
print(torch.abs(model[0].weight).mean(dim=0))

loss: 0.028859596699476242 train: 1.0 test: 0.85 tensor([ 0.0490, 0.1316, 0.0637, 0.0618, 0.0308, 0.0546, 0.0256, 0.0412, 0.0591, 0.0451, 0.0409, 0.0325])

当样本较小时, ANN 显然并没有区分出无用的 feature
train 准确率为 1, test 准确率 0.85, 明显是一种过拟合的状态.

可见, ANN 有足够的样本时, 可以自己选择 feature, 但样本不足时, 容易学习到与无关 feature 的错误的相关性(把特例当作通性), 造成过拟合.

1.3. 噪声特征

ANN 在较大样本时能自动去除噪声特征, 什么样的特征是噪声特征?

直觉上, 随机的特征应该是噪声, 例如 randn 和 rand 函数生成的特征.

下面的例子有两个 feature: (x1,x2), 真值为 200*x1, 即 x2 是无用的 feature, 看 ANN 能否发现. 若 ANN 能发现 x2, 则 w1 应接近于 200, w2 应接近到 0

1.3.1. x1,x2 随机

import torch
from itertools import count

torch.manual_seed(1000)
w1, w2 = torch.randn(1), torch.randn(1)
w1.requires_grad = True
w2.requires_grad = True

criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD([w1, w2], lr=1e-3)

N_SAMPLES = 2


def train():
    for i in count():
        y_hat = w1 * x1 + w2 * x2
        y = 200 * x1
        loss = criterion(y_hat, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if loss.item() < 0.02:
            break
    print("loss:", loss.item(), " epoch: ", i)


x1 = torch.rand(N_SAMPLES, 1)
x2 = torch.rand(N_SAMPLES, 1)
train()
print(w1, w2)

loss: 0.019999535754323006 epoch: 118101 tensor([ 199.3793]) tensor([ 0.7613])

1.3.2. 更多的样本

import torch
from itertools import count

torch.manual_seed(1000)
w1, w2 = torch.randn(1), torch.randn(1)
w1.requires_grad = True
w2.requires_grad = True

criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD([w1, w2], lr=1e-3)

N_SAMPLES = 200


def train():
    for i in count():
        y_hat = w1 * x1 + w2 * x2
        y = 200 * x1
        loss = criterion(y_hat, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if loss.item() < 0.02:
            break
    print("loss:", loss.item(), " epoch: ", i)


x1 = torch.rand(N_SAMPLES, 1)
x2 = torch.rand(N_SAMPLES, 1)
train()
print(w1, w2)

loss: 0.01999880000948906 epoch: 30926 tensor([ 199.6702]) tensor([ 0.3272])

1.3.3. x1,x2 有相关性

import torch
from itertools import count

torch.manual_seed(1000)
w1, w2 = torch.randn(1), torch.randn(1)
w1.requires_grad = True
w2.requires_grad = True

criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD([w1, w2], lr=1e-3)

N_SAMPLES = 200


def train():
    for i in count():
        y_hat = w1 * x1 + w2 * x2
        y = 200 * x1
        loss = criterion(y_hat, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if loss.item() < 0.02:
            break
    print("loss:", loss.item(), " epoch: ", i)


x1 = torch.rand(N_SAMPLES, 1)
x2 = x1 * 2
train()
print(w1, w2)

loss: 0.019913002848625183 epoch: 1968 tensor([ 39.1714]) tensor([ 80.2940])

1.3.4. 分析

假设 N_SAMPLES=2 \(loss=MSE(\hat{y},y)=\big((w_1-200)x_1^1+w_2x_1^2\big)^2+\big((w_1-200)x_2^1+w_2x_2^2\big)^2\\ =(w_1-200)^2((x_1^1)^2+(x_2^1)^2)+w_2^2((x_1^2)^2+(x_2^2)^2)+2(w_1-100)w_2(x_1^1x_1^2+x_2^1x_2^2)\)

考虑 \(f(x,y)=x^2+y^2+kxy\), 通过求导及观察函数的图形 academo 可以发现:

当 k=2 或 -2 时, 函数在 \(y=-x\) 线上都有极小值为 0
当 -2<k<2 时, 函数极小值在 (0,0) 处

所以 \(-1 < \frac{x_1^1x_1^2+x_2^1x_2^2}{\sqrt{(x_1^1)^2+(x_2^1)^2}*\sqrt{(x_1^2)^2+(x_2^2)^2}} < 1\) 时, 函数有极小值在 (200,0) 处

而上面这个公式, 实际上是两个向量的 cosine similarity

若针对所有样本, 噪声与其它特征方向不完全相同, 则这个特征向量中的这个噪声可以被 ANN 丢弃.
样本越多,噪声越容易被发现, 因为样本越多, 计算 cosine similarity 时向量长度越大, 其 similarity 越不容易为 1

1.4. 结论

只要有足够多的样本, ANN 能自动完成特征的提取和选择. 在实际应用中, 只要在样本不足,或 feature 过多时, 才可能需要手动的做一下特征提取与特征选择.