Feature Engineering

  • ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ์ฃผ์–ด์ง„ feature๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด feature๋ฅผ ๋ฝ‘์•„๋‚ด๋Š” ์ž‘์—…

๋ฐ์ดํ„ฐ ์ค€๋น„

  • ๋†์–ด์˜ ๊ธธ์ด, ๋†’์ด, ๋‘๊ป˜ ๋ฐ์ดํ„ฐ๊ฐ€ feature
  • ๋†์–ด์˜ ๋ฌด๊ฒŒ ๋ฐ์ดํ„ฐ๊ฐ€ target
import pandas as pd
df = pd.read_csv("https://bit.ly/perch_csv_data")
perch_full = df.to_numpy()
perch_full
array([[ 8.4 ,  2.11,  1.41],
       [13.7 ,  3.53,  2.  ],
       [15.  ,  3.82,  2.43],
       [16.2 ,  4.59,  2.63],
       [17.4 ,  4.59,  2.94],
       [18.  ,  5.22,  3.32],
       [18.7 ,  5.2 ,  3.12],
       [19.  ,  5.64,  3.05],
       [19.6 ,  5.14,  3.04],
       [20.  ,  5.08,  2.77],
       [21.  ,  5.69,  3.56],
       [21.  ,  5.92,  3.31],
       [21.  ,  5.69,  3.67],
       [21.3 ,  6.38,  3.53],
       [22.  ,  6.11,  3.41],
       [22.  ,  5.64,  3.52],
       [22.  ,  6.11,  3.52],
       [22.  ,  5.88,  3.52],
       [22.  ,  5.52,  4.  ],
       [22.5 ,  5.86,  3.62],
       [22.5 ,  6.79,  3.62],
       [22.7 ,  5.95,  3.63],
       [23.  ,  5.22,  3.63],
       [23.5 ,  6.28,  3.72],
       [24.  ,  7.29,  3.72],
       [24.  ,  6.38,  3.82],
       [24.6 ,  6.73,  4.17],
       [25.  ,  6.44,  3.68],
       [25.6 ,  6.56,  4.24],
       [26.5 ,  7.17,  4.14],
       [27.3 ,  8.32,  5.14],
       [27.5 ,  7.17,  4.34],
       [27.5 ,  7.05,  4.34],
       [27.5 ,  7.28,  4.57],
       [28.  ,  7.82,  4.2 ],
       [28.7 ,  7.59,  4.64],
       [30.  ,  7.62,  4.77],
       [32.8 , 10.03,  6.02],
       [34.5 , 10.26,  6.39],
       [35.  , 11.49,  7.8 ],
       [36.5 , 10.88,  6.86],
       [36.  , 10.61,  6.74],
       [37.  , 10.84,  6.26],
       [37.  , 10.57,  6.37],
       [39.  , 11.14,  7.49],
       [39.  , 11.14,  6.  ],
       [39.  , 12.43,  7.35],
       [40.  , 11.93,  7.11],
       [40.  , 11.73,  7.22],
       [40.  , 12.38,  7.46],
       [40.  , 11.14,  6.63],
       [42.  , 12.8 ,  6.87],
       [43.  , 11.93,  7.28],
       [43.  , 12.51,  7.42],
       [43.5 , 12.6 ,  8.14],
       [44.  , 12.49,  7.6 ]])
import numpy as np
perch_weight = np.array([5.9, 32.0, 40.0, 51.5, 70.0, 100.0, 78.0, 80.0, 85.0, 85.0, 110.0,
       115.0, 125.0, 130.0, 120.0, 120.0, 130.0, 135.0, 110.0, 130.0,
       150.0, 145.0, 150.0, 170.0, 225.0, 145.0, 188.0, 180.0, 197.0,
       218.0, 300.0, 260.0, 265.0, 250.0, 250.0, 300.0, 320.0, 514.0,
       556.0, 840.0, 685.0, 700.0, 700.0, 690.0, 900.0, 650.0, 820.0,
       850.0, 900.0, 1015.0, 820.0, 1100.0, 1000.0, 1100.0, 1000.0,
       1000.0])

ํ…Œ์ŠคํŠธ ์„ธํŠธ์™€ ํŠธ๋ ˆ์ด๋‹ ์„ธํŠธ๋กœ ๋‚˜๋ˆ”

from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(
    perch_full, perch_weight, random_state=42
)
lr = LinearRegression()
lr.fit(train_input, train_target)
LinearRegression()
lr.score(train_input, train_target)
0.9559326821885706
lr.score(test_input, test_target)
0.8796419177546367

์ƒˆ๋กœ์šด feature ์ƒ์„ฑ

  • ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ณ€ํ™˜๊ธฐ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด feature ์ƒ์„ฑ ๊ฐ€๋Šฅ
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(include_bias=False)
poly.fit(train_input)
train_poly = poly.transform(train_input)

feature 6๊ฐœ ์ถ”๊ฐ€์ƒ์„ฑ

train_poly.shape
(42, 9)
  • get_feature_names()
  • ๊ฐ feature๊ฐ€ ์–ด๋–ค ์ž…๋ ฅ์˜ ์กฐํ•ฉ์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์Œ
poly.get_feature_names()
/Users/mz01-lyoonj/opt/miniconda3/envs/ml/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']

๋‹ค์ค‘ํšŒ๊ท€๋ชจ๋ธ ํ›ˆ๋ จ

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_poly, train_target)
LinearRegression()
  • ๋งค์šฐ ๋†’์€ ์ ์ˆ˜ ๋‚˜์˜ด
  • feature๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด ์„ ํ˜•ํšŒ๊ท€ ๋Šฅ๋ ฅ์€ ๋งค์šฐ ๊ฐ•ํ•ด์ง
lr.score(train_poly, train_target)
0.9903183436982124
test_poly = poly.transform(test_input)
print(lr.score(test_poly, test_target))
0.9714559911594134
  • ํŠน์„ฑ์„ ๋” ๋งŽ์ด ์ถ”๊ฐ€ํ•ด๋ณด์ž
  • degree ๋งค๊ฐœ๋ณ€์ˆ˜ ํ™œ์šฉํ•ด์„œ ๊ณ ์ฐจํ•ญ์˜ ์ตœ๋Œ€์ฐจ์ˆ˜ ์ง€์ • ๊ฐ€๋Šฅ
poly = PolynomialFeatures(degree=5, include_bias=False)
poly.fit(train_input)
train_poly = poly.transform(train_input)
test_poly = poly.transform(test_input)
print(trans_poly.shape)
(42, 55)
lr2 = LinearRegression()
lr2.fit(train_poly, train_target)
print(lr2.score(train_poly, train_target))
0.9999999999991097
  • ํ…Œ์ŠคํŠธ ์„ธํŠธ ์ ์ˆ˜๊ฐ€ ์Œ์ˆ˜๋กœ ๋‚˜์˜ด
  • ์„ ํ˜•๋ชจ๋ธ์ด ์ง€๋‚˜์น˜๊ฒŒ ๊ฐ•๋ ฅํ•ด์ ธ์„œ ํ›ˆ๋ จ์„ธํŠธ์— overfitting๋œ ์ƒํ™ฉ
  • overfitting์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ feature๋ฅผ ์ค„์—ฌ์•ผ ํ•จ
lr2.score(test_poly, test_target)
-144.40579242684848

Regularization

  • ์ฐธ๊ณ  : https://light-tree.tistory.com/125

  • ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ๋„ˆ๋ฌด ๊ณผํ•˜๊ฒŒ ํ•™์Šตํ•˜์ง€ ๋ชปํ•˜๋„๋ก ํ›ผ๋ฐฉํ•˜๋Š” ๊ฒƒ
  • ์ฆ‰ ๋ชจ๋ธ์ด train set์— overfitting๋˜์ง€ ์•Š๊ฒŒ ํ•˜๋Š” ๊ฒƒ
  • ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์— regularization์„ ์ถ”๊ฐ€ํ•œ ๋ชจ๋ธ : Ridge, Lasso
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(train_poly)
train_scaled = scaler.transform(train_poly)
test_scaled = scaler.transform(test_poly)

Norm

png

  • norm : ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ(or ๊ธธ์ด)๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•(or ํ•จ์ˆ˜)
  • L2 norm : ์ดˆ๋ก์ƒ‰
  • L1 norm : ์ดˆ๋ก์ƒ‰ ์™ธ์˜ ๋ชจ๋“  ๊ฑฐ๋ฆฌ

L2 loss

png

  • L2 norm์˜ ์ œ๊ณฑ์˜ ํ•ฉ

L1 loss

png

  • ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก์น˜ ์‚ฌ์ด์˜ ์ฐจ์˜ ์ ˆ๋Œ€๊ฐ’๋“ค์˜ ํ•ฉ

L1 Regularization

png

  • cost function์— ๊ฐ€์ค‘์น˜์˜ ์ ˆ๋Œ€๊ฐ’์„ ๋”ํ•ด์ค€๋‹ค
  • ๊ฐ€์ค‘์น˜๊ฐ€ ํฌ์ง€ ์•Š์€ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋œ๋‹ค.
  • ํ•™์Šต๋ฅ ์ด 0์ผ ๊ฒฝ์šฐ ์ •๊ทœํ™” ํšจ๊ณผ ์—†์Œ
  • Lasso ๊ฐ€ ์ด์— ํ•ด๋‹น

L2 Regularization

png

  • cost function์— ๊ฐ€์ค‘์น˜์˜ ์ œ๊ณฑ์„ ๋”ํ•ด์ค€๋‹ค.
  • L1 Regularization๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฐ€์ค‘์น˜๊ฐ€ ํฌ์ง€ ์•Š์€ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋œ๋‹ค.
  • Ridge ๊ฐ€ ์ด์— ํ•ด๋‹น

Ridge

  • ๊ณ„์ˆ˜๋ฅผ ์ œ๊ณฑํ•œ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทœ์ œ ์ ์šฉ
from sklearn.linear_model import Ridge

ridge=Ridge()
ridge.fit(train_scaled, train_target)
print(ridge.score(train_scaled, train_target))
0.9896101671037343
ridge.score(test_scaled, test_target)
0.9790693977615391
  • ๊ทœ์ œ์˜ ์–‘์„ ์ž„์˜๋กœ ์กฐ์ ˆ ๊ฐ€๋Šฅ
  • alpha ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๊ทœ์ œ ๊ฐ•๋„ ์กฐ์ ˆ
  • alpha ๊ฐ’์ด ํฌ๋ฉด ๊ทœ์ œ ๊ฐ•๋„๊ฐ€ ์„ธ์ง -> ๊ณ„์ˆ˜ ๊ฐ’์„ ๋” ์ค„์ด๊ณ  ๊ณผ์†Œ์ ํ•ฉ๋˜๋„๋ก ์œ ๋„

  • ์ ์ ˆํ•œ alpha ๊ฐ’ ์ฐพ๊ธฐ : alpha ๊ฐ’์— ๋Œ€ํ•œ R^2 ๊ฐ’์˜ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ ค๋ณด๊ธฐ
  • train set๊ณผ test set์˜ ์ ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ง€์ ์ด ์ตœ์ ์˜ alpha ๊ฐ’
import matplotlib.pyplot as plt

train_score = list()
test_score = list()

alpha_list = [0.001, 0.01, 0.1, 1, 10, 100]

for alpha in alpha_list:
    ridge = Ridge(alpha = alpha)
    ridge.fit(train_scaled, train_target)
    
    train_score.append(ridge.score(train_scaled, train_target))
    test_score.append(ridge.score(test_scaled, test_target))
plt.plot(np.log10(alpha_list), train_score)
plt.plot(np.log10(alpha_list), test_score)

plt.xlabel('alpha')
plt.ylabel('score')
plt.show()

โ€‹
png โ€‹

์ ์ ˆํ•œ alpha ๊ฐ’ : 0.1

ridge = Ridge(alpha=0.1)
ridge.fit(train_scaled, train_target)
print(ridge.score(train_scaled, train_target))
print(ridge.score(test_scaled, test_target))
0.9903815817570365
0.9827976465386884

Lasso

๊ณ„์ˆ˜์˜ ์ ˆ๋Œ“๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทœ์ œ ์ ์šฉ

from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(train_scaled, train_target)
print(lasso.score(train_scaled, train_target))
0.989789897208096
lasso.score(test_scaled, test_target)
0.9800593698421883
train_score = list()
test_score = list()

alpha_list = [0.001, 0.01, 0.1, 1, 10, 100]

for alpha in alpha_list:
    lasso = Lasso(alpha = alpha)
    lasso.fit(train_scaled, train_target)
    
    train_score.append(lasso.score(train_scaled, train_target))
    test_score.append(lasso.score(test_scaled, test_target))
plt.plot(np.log10(alpha_list), train_score)
plt.plot(np.log10(alpha_list), test_score)

plt.xlabel('alpha')
plt.ylabel('score')
plt.show()

โ€‹
png โ€‹

lasso = Lasso(alpha=10)
lasso.fit(train_scaled, train_target)
print(lasso.score(train_scaled, train_target))
print(lasso.score(test_scaled, test_target))
0.9888067471131867
0.9824470598706695