Prepreperation

  • train set์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์„ธํŒ…ํ•˜๊ณ  ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค

Train Set


import numpy as np

perch_length = np.array(
   [8.4, 13.7, 15.0, 16.2, 17.4, 18.0, 18.7, 19.0, 19.6, 20.0,
    21.0, 21.0, 21.0, 21.3, 22.0, 22.0, 22.0, 22.0, 22.0, 22.5,
    22.5, 22.7, 23.0, 23.5, 24.0, 24.0, 24.6, 25.0, 25.6, 26.5,
    27.3, 27.5, 27.5, 27.5, 28.0, 28.7, 30.0, 32.8, 34.5, 35.0,
    36.5, 36.0, 37.0, 37.0, 39.0, 39.0, 39.0, 40.0, 40.0, 40.0,
    40.0, 42.0, 43.0, 43.0, 43.5, 44.0]
    )
   
perch_weight = np.array(
   [5.9, 32.0, 40.0, 51.5, 70.0, 100.0, 78.0, 80.0, 85.0, 85.0,
    110.0, 115.0, 125.0, 130.0, 120.0, 120.0, 130.0, 135.0, 110.0,
    130.0, 150.0, 145.0, 150.0, 170.0, 225.0, 145.0, 188.0, 180.0,
    197.0, 218.0, 300.0, 260.0, 265.0, 250.0, 250.0, 300.0, 320.0,
    514.0, 556.0, 840.0, 685.0, 700.0, 700.0, 690.0, 900.0, 650.0,
    820.0, 850.0, 900.0, 1015.0, 820.0, 1100.0, 1000.0, 1100.0,
    1000.0, 1000.0]
    )

Train Set๊ณผ Test Set์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ

from sklearn.model_selection import train_test_split

# ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ„๊ธฐ
train_input, test_input, train_target, test_target = train_test_split(
   perch_length, perch_weight, random_state=42)

# ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ 2์ฐจ์› ๋ฐฐ์—ด๋กœ ๋ณ€๊ฒฝ
train_input = train_input.reshape(-1, 1)
test_input = test_input.reshape(-1, 1)

Train

from sklearn.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor(n_neighbors=3)

# k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค
knr.fit(train_input, train_target)

score

print(knr.score(test_input,test_target))

output : 0.992809406101064

k-NN์˜ ํ•œ๊ณ„

  • k-NN ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๊ธธ์ด๊ฐ€ 50cm์ธ ๋†์–ด์˜ ๋ฌด๊ฒŒ ์˜ˆ์ธก

prediction

print(knr.predict([[50]]))

output : 1033.33333333

  • Train Set๊ณผ ์˜ˆ์ธก๊ฐ’ ์‹œ๊ฐํ™”
# 50cm ๋†์–ด์˜ ์ด์›ƒ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค
distances, indexes = knr.kneighbors([[50]])

# ํ›ˆ๋ จ ์„ธํŠธ์˜ ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค
plt.scatter(train_input, train_target)
# ํ›ˆ๋ จ ์„ธํŠธ ์ค‘์—์„œ ์ด์›ƒ ์ƒ˜ํ”Œ๋งŒ ๋‹ค์‹œ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค
plt.scatter(train_input[indexes], train_target[indexes], marker='D')
# 50cm ๋†์–ด ๋ฐ์ดํ„ฐ
plt.scatter(50, 1033, marker='^')
plt.xlabel('length')
plt.ylabel('weight')
plt.show()

output :

  • ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ k-NN ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๊ธธ์ด๊ฐ€ 100cm์ธ ๋†์–ด์˜ ๋ฌด๊ฒŒ ์˜ˆ์ธก

k-NN์˜ ํ•œ๊ณ„

  • k-NN(k-์ตœ๊ทผ์ ‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜) ์€ input ๊ฐ’๊ณผ ๊ฐ€์žฅ ์ตœ๊ทผ์ ‘ํ•˜๋Š” ์ด์›ƒ n๊ฐœ์˜ ํ‰๊ท  ํ•จ์œผ๋กœ์จ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์ž„(n = n_neighbor)
  • rain set์˜ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์˜ˆ์ธก์„ ํ•˜์ง€ ๋ชปํ•จ

Linear Regression (์„ ํ˜• ํšŒ๊ท€)

  • ์œ„์™€ ๊ฐ™์€ ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์„ ํ˜• ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ(์ง์„ ) ์˜ˆ์ธก์„ ์ˆ˜ํ–‰
  • ๊ฐ„๋‹จํ•˜๋ฉฐ ์ง๊ด€์ ์ด๋ฉฐ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚จ

Train

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

# ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ ํ›ˆ๋ จ
lr.fit(train_input, train_target)

output : LinearRegression()

prediction

# 50cm ๋†์–ด์— ๋Œ€ํ•œ ์˜ˆ์ธก
print(lr.predict([[50]]))

output : [1241.83860323]

  • ์ง์„ (y = ax + b)์„ ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ธฐ์šธ๊ธฐ์™€ ์ ˆํŽธ, ์ฆ‰ a์™€ b๊ฐ’์ด ์žˆ์–ด์•ผํ•จ
  • ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„์˜ ๋ชฉ์  -> ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” a, b๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ
  • LinearRegression ํด๋ž˜์Šค๊ฐ€ ์ฐพ์€ a, b๊ฐ’์€ lr ๊ฐ์ฒด์˜ coef_์™€ intercept_์— ์ €์žฅ๋˜์–ด ์žˆ์Œ

LinearRegression

print(lr.coef_, lr.intercept_)

output.: [39.01714496] -709.0186449535477

  • ์ฐพ์€ a,b๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ์„ ํ˜•ํ•จ์ˆ˜๋ฅผ ๊ทธ๋ฆฐ๋‹ค
# ํ›ˆ๋ จ ์„ธํŠธ์˜ ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค
plt.scatter(train_input, train_target)

output :

# 15์—์„œ 50๊นŒ์ง€ 1์ฐจ ๋ฐฉ์ •์‹ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค
plt.plot([15, 50], [15*lr.coef_+lr.intercept_, 50*lr.coef_+lr.intercept_])

# 50cm ๋†์–ด ๋ฐ์ดํ„ฐ
plt.scatter(50, 1241.8, marker='^')
plt.xlabel('length')
plt.ylabel('weight')
plt.show()

output :

  • lr score ํ™•์ธ
print(lr.score(train_input, train_target))
print(lr.score(test_input , test_target))

output :

0.9398463339976041
0.8247503123313559

Loss - ์„ ํ˜• ํšŒ๊ท€์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์˜ค์ฐจ, ์†์‹ค (added by emma)

  • ๋ฐ์ดํ„ฐ๋“ค์„ ์„ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์€ ์‹ค์ œ ๋ฐ์ดํ„ฐ์™€ ์•ฝ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒ
  • ์˜ค์ฐจ, ์†์‹ค(Loss)
  • ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด A๋Š” 3, B ๋Š” 1๋งŒํผ์˜ ์†์‹ค์ด ๋ฐœ์ƒํ•จ
  • ์—„๋ฐ€ํžˆ ๋ณด๋ฉด +, -๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ์ด์•ผ๊ธฐ ํ•œ ๊ฒƒ์ž„
  • ์„ ๊ณผ ์‹ค์ œ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์— ์–ผ๋งˆ๋‚˜ ์˜ค์ฐจ๊ฐ€ ์žˆ๋Š”์ง€ ๊ตฌํ•˜๋ ค๋ฉด ์–‘์ˆ˜,์Œ์ˆ˜ ๊ด€๊ณ„์—†์ด ๋™์ผํ•˜๊ฒŒ ๋ฐ˜์˜๋˜๋„๋ก
  • ๋ชจ๋“  ์†์‹ค์— ์ œ๊ณฑ์„ ํ•ด์ฃผ๋Š”๊ฒŒ ์ข‹์Œ (์œ„ ๊ทธ๋ฆผ์˜ ์˜ค์ฐจ๋ฅผ ์ œ๊ณฑ๊ฐ’์œผ๋กœ ์„ค๋ช…ํ•˜๋ฉด A๋Š” 9, B๋Š” 1์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œ)
  • ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ์†์‹ค์„ ๊ตฌํ•˜๋Š” ๊ฑธ ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ(mean squared error, MSE)
  • ์†์‹ค์„ ๊ตฌํ• ๋•Œ ๊ฐ€์žฅ ๋„๋ฆฌ ์“ฐ์ž„
  • ์ œ๊ณฑํ•˜์ง€ ์•Š๊ณ  ์ ˆ๋Œ€๊ฐ’์œผ๋กœ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ํ‰๊ท  ์ ˆ๋Œ€ ์˜ค์ฐจ(mean absolute error, MAE)
  • MSE์™€ MAE๋ฅผ ์ ˆ์ถฉํ•œ ํ›„๋ฒ„ ์†์‹ค(Huber loss), 1-MSE/VAR์œผ๋กœ ๊ตฌํ•˜๋Š” ๊ฒฐ์ • ๊ณ„์ˆ˜(coefficient of determination) ๋“ฑ์ด ์žˆ์Œ
  • ๊ฒฐ๊ตญ ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์˜ ๋ชฉํ‘œ : ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋‚˜ํƒ€๋‚˜๋Š” ์˜ค์ฐจ์˜ ํ‰๊ท ์„ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์ ์˜ ๊ธฐ์šธ๊ธฐ์™€ ์ ˆํŽธ์„ ์ฐพ๋Š” ๊ฒƒ
  • ์†์‹ค์„ ์ตœ์†Œํ™” ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ• : ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• (Gradient Descent)

Linear Regression ์˜ ํ•œ๊ณ„

  • ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด lr์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ธธ์ด๊ฐ€ 5cm์ธ ๋†์–ด๋ฅผ ์˜ˆ์ธกํ–ˆ์„ ๊ฒฝ์šฐ ๋ฌด๊ฒŒ๊ฐ€ ์Œ์ˆ˜๊ฐ’์ด ๋‚˜์˜ด
  • ์ด๋Š” ํ˜„์‹ค์—์„œ๋Š” ๋‚˜์˜ฌ ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ์ž„

Polynomial Regression (๋‹คํ•ญ์‹ ํšŒ๊ท€)

  • ์ด๋Ÿฌํ•œ ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•˜๊ณ  ๊ณก์„ ์œผ๋กœ ์„ ํ˜•์„ ๊ทธ๋ ค, ๋ณด๋‹ค ์ตœ์ ํ™”๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ
  • ์ด์ฐจํ•จ์ˆ˜๋ฅผ ์ด์šฉ (y = ax^2 + bx + c)

  • train set ์ƒ์„ฑ

    ๋„˜ํŒŒ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ x์˜ ์ œ๊ณฑ๊ฐ’๊ณผ x๊ฐ’์œผ๋กœ train set๊ณผ test set ์ƒ์„ฑ

    train_poly = np.column_stack((train_input ** 2, train_input))
    test_poly = np.column_stack((test_input ** 2, test_input))
    
    print(train_poly.shape, test_poly.shape)
    

    output : (42, 2) (14, 2)

  • train_poly๋งŒ ์ฐ์–ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด array๊ฐ€ ์ƒ์„ฑ๋จ
    train_poly
    

    output:

    array([[ 384.16, 19.6 ], [ 484. , 22. ], [ 349.69, 18.7 ], [ 302.76, 17.4 ], [1296. , 36. ], [ 625. , 25. ], [1600. , 40. ], [1521. , 39. ], [1849. , 43. ], [ 484. , 22. ], [ 400. , 20. ], [ 484. , 22. ], [ 576. , 24. ], [ 756.25, 27.5 ], [1849. , 43. ], [1600. , 40. ], [ 576. , 24. ] โ€ฆ

  • ๊ธธ์ด๊ฐ€ 50cm์ธ ๋†์–ด์˜ ๋ฌด๊ฒŒ ์˜ˆ์ธก

    prediction

    lr = LinearRegression()
    lr.fit(train_poly, train_target)
    
    print(lr.predict([[50**2, 50]]))
    

    output : [1573.98423528]

  • y = ax^2 + bx + c ์™€ ๊ฐ™์€ ๊ณก์„ ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•˜์—ฌ, LinearRegression ํด๋ž˜์Šค ์ด์šฉํ•ด์„œ a,b,c ๊ฐ’์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ
    print(lr.coef_, lr.intercept_)
    

    output : [ 1.01433211 -21.55792498] 116.0502107827827

  • ์ฐพ์€ a,b,c ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ๋‹คํ•ญํ•จ์ˆ˜๋ฅผ ๊ทธ๋ฆฐ๋‹ค
    # ๊ตฌ๊ฐ„๋ณ„ ์ง์„ ์„ ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•ด 15์—์„œ 49๊นŒ์ง€ ์ •์ˆ˜ ๋ฐฐ์—ด์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค
    point = np.arange(15, 50)
    
    # ํ›ˆ๋ จ ์„ธํŠธ์˜ ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค
    plt.scatter(train_input, train_target)
    
    # 15์—์„œ 49๊นŒ์ง€ 2์ฐจ ๋ฐฉ์ •์‹ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค
    plt.plot(point, 1.01*point**2 - 21.6*point + 116.05)
    
    # 50cm ๋†์–ด ๋ฐ์ดํ„ฐ
    plt.scatter([50], [1574], marker='^')
    plt.xlabel('length')
    plt.ylabel('weight')
    plt.show()
    

    output :