์ฃผ์„ฑ๋ถ„ ๋ถ„์„Principal Component Analysis

  • ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ๋ถ„์‚ฐ๋ฐ์ดํ„ฐ๊ฐ€ ๋„๋ฆฌ ํผ์ ธ์žˆ๋Š” ์ •๋„ ์ด ํฐ ๋ฐฉํ–ฅ์„ ์ฐพ๋Š” ๊ฒƒ
  • ๊ณ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์›์œผ๋กœ ์ถ•์†Œ์‹œํ‚ค๋Š” ๋Œ€ํ‘œ์ ์ธ ์ฐจ์› ์ถ•์†Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜
  • ๋ฐ์ดํ„ฐ๋“ค์„ ์ •์‚ฌ์˜ ์‹œ์ผœ ์ฐจ์›์„ ๋‚ฎ์ถ˜๋‹ค๋ฉด ์–ด๋–ค ๋ฒกํ„ฐ์— ๋ฐ์ดํ„ฐ๋ฅผ ์ •์‚ฌ์˜ ์‹œ์ผœ์•ผ ์›๋ž˜ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์ œ์ผ ์ž˜ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์„๊นŒ
  • ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค๊ณผ ๋‚ฎ์ถ˜ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์™€์˜ ์ฐจ๊ฐ€ ๊ฐ€์žฅ ์ž‘๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰
  • ๋ณ€์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„ ์ด๋“ค ์ค‘ ์ค‘์š”ํ•˜๋‹ค๊ณ  ํŒ๋‹จ๋˜๋Š” ๋ณ€์ˆ˜๋“ค ๋ช‡ ๊ฐœ๋งŒ ๋ฝ‘์•„ ๋ชจ๋ธ๋ง์„ ํ•˜๋ ค๊ณ  ํ•  ๋•Œ ์ฃผ๋กœ PCA๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

PCA๊ฐ€ ํ•„์š”ํ•œ ์ด์œ 

1) ์ฐจ์›์˜ ์ €์ฃผ

  • ๋ฐ์ดํ„ฐ์…‹์˜ feature๊ฐ€ ๋งŽ์•„์ง€๋ฉด ์ฐจ์› ๋˜ํ•œ ์ฆ๊ฐ€

  • ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์ด ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„์˜ ๋ถ€ํ”ผ๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ -> ๋ฐ์ดํ„ฐ์˜ ๋ฐ€๋„๋Š” ์ฐจ์›์ด ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก ํฌ์†Œํ•ด์ง

  • ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์ด ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋„ ์ฆ๊ฐ€ -> ๋ชจ๋ธ์ด ๋ณต์žกํ•ด์ ธ ์˜ค๋ฒ„ํ”ผํŒ… ์œ„ํ—˜ ์ปค์ง

2) ๋‹ค์ค‘ ๊ณต์‚ฐ์„ฑ ๋ฌธ์ œ

  • ํšŒ๊ท€๋ถ„์„์˜ ์ „์ œ์กฐ๊ฑด. ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์œผ๋ฉด ์•ˆ๋จ
  • ์„œ๋กœ ์˜์กด์„ฑ์ด ๋†’์€ feature๋“ค์„ ํ•จ๊ป˜ ํ•™์Šต์‹œํ‚ค๋ฉด ์˜ค๋ฒ„ํ”ผํŒ…๋˜๋Š” ๊ฒฝ์šฐ ์ƒ๊น€

PCA ๊ณผ์ •

1) ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ์ถ•์†Œ

  • x1์ถ•์œผ๋กœ ์ถ•์†Œ๋ฅผ ํ•  ์ˆ˜๋„ x2์ถ•์œผ๋กœ ์ถ•์†Œ๋ฅผ ํ•  ์ˆ˜๋„ ์žˆ์Œ
  • ์ด ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฒน์น˜๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ -> ์ •๋ณด ์œ ์‹ค





2) ์ƒˆ๋กœ์šด ์ถ• ์ฐพ๊ธฐ

  • x1๊ณผ x2์ถ•์ด ์•„๋‹Œ ์ƒˆ๋กœ์šด ์ถ•์„ ์ฐพ์•„์•ผํ•จ
  • ์ถ•์€ ๊ฐ ์ ๋“ค์ด ํผ์ ธ์žˆ๋Š” ์ •๋„์ธ ๋ถ„์‚ฐ์ด ๊ฐ€์žฅ ํฌ๊ฒŒ ๋˜๋„๋ก ์žก์•„์•ผํ•จ
  • ์ด ์ถ•์˜ ๋ฒกํ„ฐ๋ฅผ ์ฃผ์„ฑ๋ถ„(principal component) ์ด๋ผ๊ณ  ํ•œ๋‹ค.

์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ค ๋ฒกํ„ฐ์— ๋‚ด์ ํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์ฃผ๋Š”๊ฐ€

Example

billboard hot100 chart audio-feature (feature 11๊ฐœ)

trackdf

test

11๊ฐœ feature๋ฅผ ์ฃผ์„ฑ๋ถ„๋ถ„์„์„ ํ™œ์šฉํ•˜์—ฌ 2๊ฐœ๋กœ ์ค„์—ฌ์„œ ํ˜„์žฌ ๋ฐ์ดํ„ฐ ํ‘œํ˜„ํ•˜๊ธฐ

featurelist = list(trackdf.columns)[1:]
tracktrain = trackdf[featurelist]
tracktrain

test

tracktrain ์ •๊ทœํ™”

Scaler = StandardScaler()
tracktrain = Scaler.fit_transform(tracktrain)
tracktrain = pd.DataFrame(tracktrain,columns=featurelist)
tracktrain

test

PCA -> 2์ฐจ์› ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“ค๊ธฐ

pca = PCA(n_components=2)
principal = pca.fit(tracktrain)
principaltransform = pca.fit_transform(tracktrain)
pc = principal.components_
principalDF = pd.DataFrame(data = principaltransform, columns = ['principalcomponent01','principalcomponent02'])

์ฃผ์„ฑ๋ถ„๋ฒกํ„ฐ

pc
array([[-0.30323406, -0.5291131 , -0.06083497, -0.45635198,  0.10039601,
        -0.01568293,  0.49884875, -0.13077555, -0.0150439 , -0.36232136,
         0.09137661],
       [ 0.44035325, -0.22264418,  0.33311453, -0.33782897, -0.39143413,
         0.41255903,  0.0387862 ,  0.30700042, -0.27256214,  0.17523132,
         0.08514396]])
principalDF

test

๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ™œ์šฉํ•˜๊ธฐ

์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„

  • ํŠน์„ฑ๊ณผ ํƒ€๊ฒŸ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๊ฐ€์žฅ ์ž˜ ๋‚˜ํƒ€๋‚ด๋Š” ์„ ํ˜•๋ฐฉ์ •์‹ ์ฐพ๊ธฐ
line_filter = LinearRegression()
p_component01 = np.array(principalDF["principalcomponent01"])
p_component02 = np.array(principalDF["principalcomponent02"])
p_component01 = p_component01.reshape(-1,1)
line_filter.fit(p_component01, p_component02)
print("z = {}*x + {}".format(line_filter.coef_[0], line_filter.intercept_))
z = 5.296358786892319e-16*x + -4.440892098500625e-17
plt.scatter(principalDF["principalcomponent01"],principalDF["principalcomponent02"])
plt.plot([-3,5],[(-3)*line_filter.coef_+line_filter.intercept_, 5*line_filter.coef_+line_filter.intercept_],color="r")

plt.ylim(-5,5)
plt.xlabel("principalcomponent01")
plt.ylabel("principalcomponent02")
plt.show()

test