Feature Engineering
- ๋ชจ๋ธ์ ์ ํ๋๋ฅผ ๋์ด๊ธฐ ์ํด ์ฃผ์ด์ง feature๋ฅผ ์กฐํฉํ์ฌ ์๋ก์ด feature๋ฅผ ๋ฝ์๋ด๋ ์์
๋ฐ์ดํฐ ์ค๋น
- ๋์ด์ ๊ธธ์ด, ๋์ด, ๋๊ป ๋ฐ์ดํฐ๊ฐ feature
- ๋์ด์ ๋ฌด๊ฒ ๋ฐ์ดํฐ๊ฐ target
import pandas as pd
df = pd.read_csv("https://bit.ly/perch_csv_data")
perch_full = df.to_numpy()
perch_full
array([[ 8.4 , 2.11, 1.41],
[13.7 , 3.53, 2. ],
[15. , 3.82, 2.43],
[16.2 , 4.59, 2.63],
[17.4 , 4.59, 2.94],
[18. , 5.22, 3.32],
[18.7 , 5.2 , 3.12],
[19. , 5.64, 3.05],
[19.6 , 5.14, 3.04],
[20. , 5.08, 2.77],
[21. , 5.69, 3.56],
[21. , 5.92, 3.31],
[21. , 5.69, 3.67],
[21.3 , 6.38, 3.53],
[22. , 6.11, 3.41],
[22. , 5.64, 3.52],
[22. , 6.11, 3.52],
[22. , 5.88, 3.52],
[22. , 5.52, 4. ],
[22.5 , 5.86, 3.62],
[22.5 , 6.79, 3.62],
[22.7 , 5.95, 3.63],
[23. , 5.22, 3.63],
[23.5 , 6.28, 3.72],
[24. , 7.29, 3.72],
[24. , 6.38, 3.82],
[24.6 , 6.73, 4.17],
[25. , 6.44, 3.68],
[25.6 , 6.56, 4.24],
[26.5 , 7.17, 4.14],
[27.3 , 8.32, 5.14],
[27.5 , 7.17, 4.34],
[27.5 , 7.05, 4.34],
[27.5 , 7.28, 4.57],
[28. , 7.82, 4.2 ],
[28.7 , 7.59, 4.64],
[30. , 7.62, 4.77],
[32.8 , 10.03, 6.02],
[34.5 , 10.26, 6.39],
[35. , 11.49, 7.8 ],
[36.5 , 10.88, 6.86],
[36. , 10.61, 6.74],
[37. , 10.84, 6.26],
[37. , 10.57, 6.37],
[39. , 11.14, 7.49],
[39. , 11.14, 6. ],
[39. , 12.43, 7.35],
[40. , 11.93, 7.11],
[40. , 11.73, 7.22],
[40. , 12.38, 7.46],
[40. , 11.14, 6.63],
[42. , 12.8 , 6.87],
[43. , 11.93, 7.28],
[43. , 12.51, 7.42],
[43.5 , 12.6 , 8.14],
[44. , 12.49, 7.6 ]])
import numpy as np
perch_weight = np.array([5.9, 32.0, 40.0, 51.5, 70.0, 100.0, 78.0, 80.0, 85.0, 85.0, 110.0,
115.0, 125.0, 130.0, 120.0, 120.0, 130.0, 135.0, 110.0, 130.0,
150.0, 145.0, 150.0, 170.0, 225.0, 145.0, 188.0, 180.0, 197.0,
218.0, 300.0, 260.0, 265.0, 250.0, 250.0, 300.0, 320.0, 514.0,
556.0, 840.0, 685.0, 700.0, 700.0, 690.0, 900.0, 650.0, 820.0,
850.0, 900.0, 1015.0, 820.0, 1100.0, 1000.0, 1100.0, 1000.0,
1000.0])
ํ ์คํธ ์ธํธ์ ํธ๋ ์ด๋ ์ธํธ๋ก ๋๋
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(
perch_full, perch_weight, random_state=42
)
lr = LinearRegression()
lr.fit(train_input, train_target)
LinearRegression()
lr.score(train_input, train_target)
0.9559326821885706
lr.score(test_input, test_target)
0.8796419177546367
์๋ก์ด feature ์์ฑ
- ์ฌ์ดํท๋ฐ์์ ์ ๊ณตํ๋ ๋ณํ๊ธฐ ์ฌ์ฉํ์ฌ ์๋ก์ด feature ์์ฑ ๊ฐ๋ฅ
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(include_bias=False)
poly.fit(train_input)
train_poly = poly.transform(train_input)
feature 6๊ฐ ์ถ๊ฐ์์ฑ
train_poly.shape
(42, 9)
- get_feature_names()
- ๊ฐ feature๊ฐ ์ด๋ค ์ ๋ ฅ์ ์กฐํฉ์ผ๋ก ๋ง๋ค์ด์ก๋์ง ์ ์ ์์
poly.get_feature_names()
/Users/mz01-lyoonj/opt/miniconda3/envs/ml/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']
๋ค์คํ๊ท๋ชจ๋ธ ํ๋ จ
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_poly, train_target)
LinearRegression()
- ๋งค์ฐ ๋์ ์ ์ ๋์ด
- feature๊ฐ ๋์ด๋๋ฉด ์ ํํ๊ท ๋ฅ๋ ฅ์ ๋งค์ฐ ๊ฐํด์ง
lr.score(train_poly, train_target)
0.9903183436982124
test_poly = poly.transform(test_input)
print(lr.score(test_poly, test_target))
0.9714559911594134
- ํน์ฑ์ ๋ ๋ง์ด ์ถ๊ฐํด๋ณด์
- degree ๋งค๊ฐ๋ณ์ ํ์ฉํด์ ๊ณ ์ฐจํญ์ ์ต๋์ฐจ์ ์ง์ ๊ฐ๋ฅ
poly = PolynomialFeatures(degree=5, include_bias=False)
poly.fit(train_input)
train_poly = poly.transform(train_input)
test_poly = poly.transform(test_input)
print(trans_poly.shape)
(42, 55)
lr2 = LinearRegression()
lr2.fit(train_poly, train_target)
print(lr2.score(train_poly, train_target))
0.9999999999991097
- ํ ์คํธ ์ธํธ ์ ์๊ฐ ์์๋ก ๋์ด
- ์ ํ๋ชจ๋ธ์ด ์ง๋์น๊ฒ ๊ฐ๋ ฅํด์ ธ์ ํ๋ จ์ธํธ์ overfitting๋ ์ํฉ
- overfitting์ ์ค์ด๊ธฐ ์ํด์ feature๋ฅผ ์ค์ฌ์ผ ํจ
lr2.score(test_poly, test_target)
-144.40579242684848
Regularization
-
์ฐธ๊ณ : https://light-tree.tistory.com/125
- ๋จธ์ ๋ฌ๋ ๋ชจ๋ธ์ด ๋๋ฌด ๊ณผํ๊ฒ ํ์ตํ์ง ๋ชปํ๋๋ก ํผ๋ฐฉํ๋ ๊ฒ
- ์ฆ ๋ชจ๋ธ์ด train set์ overfitting๋์ง ์๊ฒ ํ๋ ๊ฒ
- ์ ํ ํ๊ท ๋ชจ๋ธ์ regularization์ ์ถ๊ฐํ ๋ชจ๋ธ : Ridge, Lasso
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_poly)
train_scaled = scaler.transform(train_poly)
test_scaled = scaler.transform(test_poly)
Norm
- norm : ๋ฒกํฐ์ ํฌ๊ธฐ(or ๊ธธ์ด)๋ฅผ ์ธก์ ํ๋ ๋ฐฉ๋ฒ(or ํจ์)
- L2 norm : ์ด๋ก์
- L1 norm : ์ด๋ก์ ์ธ์ ๋ชจ๋ ๊ฑฐ๋ฆฌ
L2 loss
- L2 norm์ ์ ๊ณฑ์ ํฉ
L1 loss
- ์ค์ ๊ฐ๊ณผ ์์ธก์น ์ฌ์ด์ ์ฐจ์ ์ ๋๊ฐ๋ค์ ํฉ
L1 Regularization
- cost function์ ๊ฐ์ค์น์ ์ ๋๊ฐ์ ๋ํด์ค๋ค
- ๊ฐ์ค์น๊ฐ ํฌ์ง ์์ ๋ฐฉํฅ์ผ๋ก ํ์ต๋๋ค.
- ํ์ต๋ฅ ์ด 0์ผ ๊ฒฝ์ฐ ์ ๊ทํ ํจ๊ณผ ์์
- Lasso ๊ฐ ์ด์ ํด๋น
L2 Regularization
- cost function์ ๊ฐ์ค์น์ ์ ๊ณฑ์ ๋ํด์ค๋ค.
- L1 Regularization๊ณผ ๋ง์ฐฌ๊ฐ์ง๋ก ๊ฐ์ค์น๊ฐ ํฌ์ง ์์ ๋ฐฉํฅ์ผ๋ก ํ์ต๋๋ค.
- Ridge ๊ฐ ์ด์ ํด๋น
Ridge
- ๊ณ์๋ฅผ ์ ๊ณฑํ ๊ฐ์ ๊ธฐ์ค์ผ๋ก ๊ท์ ์ ์ฉ
from sklearn.linear_model import Ridge
ridge=Ridge()
ridge.fit(train_scaled, train_target)
print(ridge.score(train_scaled, train_target))
0.9896101671037343
ridge.score(test_scaled, test_target)
0.9790693977615391
- ๊ท์ ์ ์์ ์์๋ก ์กฐ์ ๊ฐ๋ฅ
- alpha ๋งค๊ฐ๋ณ์๋ก ๊ท์ ๊ฐ๋ ์กฐ์
-
alpha ๊ฐ์ด ํฌ๋ฉด ๊ท์ ๊ฐ๋๊ฐ ์ธ์ง -> ๊ณ์ ๊ฐ์ ๋ ์ค์ด๊ณ ๊ณผ์์ ํฉ๋๋๋ก ์ ๋
- ์ ์ ํ alpha ๊ฐ ์ฐพ๊ธฐ : alpha ๊ฐ์ ๋ํ R^2 ๊ฐ์ ๊ทธ๋ํ ๊ทธ๋ ค๋ณด๊ธฐ
- train set๊ณผ test set์ ์ ์๊ฐ ๊ฐ์ฅ ๊ฐ๊น์ด ์ง์ ์ด ์ต์ ์ alpha ๊ฐ
import matplotlib.pyplot as plt
train_score = list()
test_score = list()
alpha_list = [0.001, 0.01, 0.1, 1, 10, 100]
for alpha in alpha_list:
ridge = Ridge(alpha = alpha)
ridge.fit(train_scaled, train_target)
train_score.append(ridge.score(train_scaled, train_target))
test_score.append(ridge.score(test_scaled, test_target))
plt.plot(np.log10(alpha_list), train_score)
plt.plot(np.log10(alpha_list), test_score)
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()
โ
โ
์ ์ ํ alpha ๊ฐ : 0.1
ridge = Ridge(alpha=0.1)
ridge.fit(train_scaled, train_target)
print(ridge.score(train_scaled, train_target))
print(ridge.score(test_scaled, test_target))
0.9903815817570365
0.9827976465386884
Lasso
๊ณ์์ ์ ๋๊ฐ์ ๊ธฐ์ค์ผ๋ก ๊ท์ ์ ์ฉ
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(train_scaled, train_target)
print(lasso.score(train_scaled, train_target))
0.989789897208096
lasso.score(test_scaled, test_target)
0.9800593698421883
train_score = list()
test_score = list()
alpha_list = [0.001, 0.01, 0.1, 1, 10, 100]
for alpha in alpha_list:
lasso = Lasso(alpha = alpha)
lasso.fit(train_scaled, train_target)
train_score.append(lasso.score(train_scaled, train_target))
test_score.append(lasso.score(test_scaled, test_target))
plt.plot(np.log10(alpha_list), train_score)
plt.plot(np.log10(alpha_list), test_score)
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()
โ
โ
lasso = Lasso(alpha=10)
lasso.fit(train_scaled, train_target)
print(lasso.score(train_scaled, train_target))
print(lasso.score(test_scaled, test_target))
0.9888067471131867
0.9824470598706695