Decision Tree

์ด๋ฏธ์ง€์ถœ์ฒ˜ : ALGOBEANS https://algobeans.com/2016/07/27/decision-trees-tutorial/

  • ๊ฒฐ๊ณผ ๋ชจ๋ธ์ด Tree ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Decision Tree
  • ์Šค๋ฌด๊ณ ๊ฐœ ํ•˜๋“ฏ์ด ์˜ˆ/์•„๋‹ˆ์˜ค ์งˆ๋ฌธ์„ ์ด์–ด ๋‚˜๊ฐ€๋ฉด์„œ ํ•™์Šต
  • ๊ฒฐ์ • ํŠธ๋ฆฌ์—์„œ ์งˆ๋ฌธ์ด๋‚˜ ์ •๋‹ต์€ ๋…ธ๋“œNode๋ผ๊ณ  ํ•จ
    • ๋งจ ์ฒ˜์Œ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์„ Root Node
    • ์ค‘๊ฐ„ ๋ถ„๋ฅ˜ ๊ธฐ์ค€์„ Intermediate Node
    • ๋งˆ์ง€๋ง‰ ๋…ธ๋“œ๋ฅผ Terminal Node ๋˜๋Š” Leaf Node
  • ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๊ธฐ๋ณธ ์•„์ด๋””์–ด๋Š” Leaf Node๊ฐ€ ๊ฐ€์žฅ ์„ž์ด์ง€ ์•Š์€ ์ƒํƒœ๋กœ ์™„์ „ํžˆ ๋ถ„๋ฅ˜๋˜๋Š” ๊ฒƒ
  • ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€ ๋ฌธ์ œ์— ๋„๋ฆฌ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ
    • ๋ถ„๋ฅ˜ : Leaf Node์— ๊ฐ€์žฅ ๋งŽ์€ ํด๋ž˜์Šค๊ฐ€ ์˜ˆ์ธก ํด๋ž˜์Šค๊ฐ€ ๋จ
    • ํšŒ๊ท€ : Leaf Node์— ๋„๋‹ฌํ•œ ์ƒ˜ํ”Œ์˜ ํƒ€๊ฒŸ์„ ํ‰๊ท ํ•˜์—ฌ ์˜ˆ์ธก๊ฐ’์œผ๋กœ ์‚ฌ์šฉ

๋ถˆ์ˆœ๋„ Impurity

  • ๋ถ„๊ธฐ ๊ธฐ์ค€์„ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•ด ๋ถˆ์ˆœ๋„๋ผ๋Š” ๊ฐœ๋…์„ ์‚ฌ์šฉ
  • ๊ฒฐ์ • ํŠธ๋ฆฌ๊ฐ€ ์ตœ์ ์˜ ์งˆ๋ฌธ์„ ์ฐพ๊ธฐ ์œ„ํ•œ ๊ธฐ์ค€
  • ๋ณต์žก์„ฑ์„ ์˜๋ฏธํ•˜๋ฉฐ, ํ•ด๋‹น ๋ฒ”์ฃผ ์•ˆ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์„ž์—ฌ ์žˆ๋Š”์ง€๋ฅผ ๋œปํ•จ
  • ๋‹ค์–‘ํ•œ ๊ฐœ์ฒด๋“ค์ด ์„ž์—ฌ ์žˆ์„์ˆ˜๋ก ๋ถˆ์ˆœ๋„๊ฐ€ ๋†’์•„์ง
  • ๋ถ„๊ธฐ ๊ธฐ์ค€ ์„ค์ • ์‹œ ํ˜„์žฌ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„์— ๋น„ํ•ด ์ž์‹๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„๊ฐ€ ๊ฐ์†Œ๋˜๋„๋ก ์„ค์ •ํ•ด์•ผํ•จ

์ •๋ณด ํš๋“ Information gain

  • ํ˜„์žฌ ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„์™€ ์ž์‹๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„์˜ ์ฐจ์ด๋ฅผ ์ •๋ณด ํš๋“Information gain์ด๋ผ๊ณ  ํ•จ
  • ๋ถˆ์ˆœ๋„๊ฐ€ 1์ธ ์ƒํƒœ์—์„œ 0.7์ธ ์ƒํƒœ๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค๋ฉด ์ •๋ณด ํš๋“์€ 0.3
  • ๊ฒฐ์ • ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ •๋ณด ํš๋“์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋จ
    1. root ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„ ๊ณ„์‚ฐ
    2. ๋‚˜๋จธ์ง€ ์†์„ฑ์— ๋Œ€ํ•ด ๋ถ„ํ•  ํ›„ ์ž์‹๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„ ๊ณ„์‚ฐ
    3. ๊ฐ ์†์„ฑ์— ๋Œ€ํ•œ ์ •๋ณด ํš๋“ ๊ณ„์‚ฐ ํ›„, ์ •๋ณด ํš๋“์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋Š” ๋ถ„๊ธฐ์กฐ๊ฑด์„ ์ฐพ์•„ ๋ถ„๊ธฐ
    4. ๋ชจ๋“  Leaf๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„๊ฐ€ 0์ด ๋  ๋•Œ๊นŒ์ง€ 2,3 ๋ฐ˜๋ณต ์ˆ˜ํ–‰

๋ถˆ์ˆœ๋„ ํ•จ์ˆ˜ Gini, Entropy

  • ์ง€๋‹ˆ ์ง€์ˆ˜ Gini Index
    • ํด๋ž˜์Šค์˜ ๋น„์œจ์„ ์ œ๊ณฑํ•ด์„œ ๋”ํ•œ ๋‹ค์Œ 1์—์„œ ๋บŒ
    • ์ง€๋‹ˆ ์ง€์ˆ˜์˜ 0.5๋ฉด ๋ถˆ์ˆœ๋„๊ฐ€ ์ตœ๋Œ€ => ํ•œ ๋ฒ”์ฃผ ์•ˆ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •ํ™•ํžˆ ๋ฐ˜๋ฐ˜ ์žˆ๋‹ค
    • ๊ณต์‹
      \[\begin{split} I(A) &= 1 - (์Œ์„ฑ ํด๋ž˜์Šค ๋น„์œจ^2 + ์–‘์„ฑ ํด๋ž˜์Šค ๋น„์œจ^2) \\ &= 1 - \textstyle\sum_{k=1}^m p_k^2 \end{split}\]
    • \[p_k^2 = ๋ฒ”์ฃผ ์•ˆ์—์„œ k ํŠน์„ฑ์˜ ๊ด€์ธก์น˜ ๋น„์œจ\]
    • Decision Tree

    • ๋ฒ”์ฃผ ์•ˆ์— ๋นจ๊ฐ„์ƒ‰ ์  10๊ฐœ, ํŒŒ๋ž€์ƒ‰ ์  6๊ฐœ ์žˆ์„ ๋•Œ์˜ ๊ณ„์‚ฐ ์˜ˆ์ œ
      \[\begin{split} I(A) &= 1 - \textstyle\sum_{k=1}^m p_k^2 \\ &= 1 - \left(\cfrac{6}{16}\right)^2 - \left(\cfrac{10}{16}\right)^2 \\ &\approx 0.47 \end{split}\]
  • ์—”ํŠธ๋กœํ”ผ ์ง€์ˆ˜ Entropy Index
    • ๋ถˆ์ˆœ๋„๋ฅผ ์ˆ˜์น˜์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ธ ์ฒ™๋„
    • ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋†’๋‹ค๋Š” ๊ฒƒ์€ ๋ถˆ์ˆœ๋„๊ฐ€ ๋†’๋‹ค๋Š” ๋œป์ด๊ณ , ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์€ ๋ถˆ์ˆœ๋„๊ฐ€ ๋‚ฎ๋‹ค๋Š” ๋œป
    • ์—”ํŠธ๋กœํ”ผ๊ฐ€ 1์ด๋ฉด ๋ถˆ์ˆœ๋„๊ฐ€ ์ตœ๋Œ€ => ํ•œ ๋ฒ”์ฃผ ์•ˆ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •ํ™•ํžˆ ๋ฐ˜๋ฐ˜ ์žˆ๋‹ค
    • ์—”ํŠธ๋กœํ”ผ๊ฐ€ 0์ด๋ฉด ๋ถˆ์ˆœ๋„๊ฐ€ ์ตœ์†Œ => ํ•œ ๋ฒ”์ฃผ ์•ˆ์— ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋งŒ ์žˆ๋‹ค
    • ํด๋ž˜์Šค์˜ ๋น„์œจ์„ ๋ฐ‘์ด 2์ธ ๋กœ๊ทธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณฑํ•จ
    • ๊ณต์‹
      \[\begin{split} E(A) &= - ์Œ์„ฑ ๋น„์œจ * \log_2(์Œ์„ฑ ๋น„์œจ) - ์–‘์„ฑ ๋น„์œจ * \log_2(์–‘์„ฑ ๋น„์œจ) \\ &= - \textstyle\sum_{i=1}^k p_i\log_2(p_i) \end{split}\]
    • Decision Tree

    • ๋ฒ”์ฃผ ์•ˆ์— ๋นจ๊ฐ„์ƒ‰ ์  10๊ฐœ, ํŒŒ๋ž€์ƒ‰ ์  6๊ฐœ ์žˆ์„ ๋•Œ์˜ ๊ณ„์‚ฐ ์˜ˆ์ œ
      \[\begin{split} E(A) &= - \textstyle\sum_{k=1}^m p_k^2 \\ &= - \cfrac{6}{16}\log_2\left(\cfrac{6}{16}\right) - \cfrac{10}{16}\log_2\left(\cfrac{10}{16}\right) \\ &\approx 0.95 \end{split}\]

๊ฐ€์ง€์น˜๊ธฐ Pruning

  • ๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ์ œํ•œ ์—†์ด ์„ฑ์žฅํ•˜๋ฉด ํ›ˆ๋ จ ์„ธํŠธ์— ๊ณผ๋Œ€์ ํ•ฉOverfitting๋˜๊ธฐ ์‰ฌ์›€
  • ๊ณผ๋Œ€์ ํ•ฉOverfitting ์„ ๋ง‰๊ธฐ ์œ„ํ•œ ์ „๋žต
  • ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด(max_depth)๋‚˜ ํ„ฐ๋ฏธ๋„ ๋…ธ๋“œ์˜ ์ตœ๋Œ€ ๊ฐœ์ˆ˜, ํ•œ ๋…ธ๋“œ๊ฐ€ ๋ถ„ํ• ํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ๋ฐ์ดํ„ฐ ์ˆ˜(min_sample_split)๋“ฑ ์ œํ•œ ๊ฐ€๋Šฅ

์žฅ์ 

  • ๋ถ„์„๊ฒฐ๊ณผ๊ฐ€ โ€˜์กฐ๊ฑดA && ์กฐ๊ฑดB = ๊ฒฐ๊ณผ Cโ€™๋ผ๋Š” ํ˜•ํƒœ์˜ ๊ทœ์น™์œผ๋กœ ํ‘œํ˜„๋˜๋ฏ€๋กœ ์‰ฝ๊ฒŒ ์ดํ•ดํ•˜๊ณ  ํ™œ์šฉ ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ์„ ์‰ฝ๊ฒŒ ์‹œ๊ฐํ™” ํ•  ์ˆ˜ ์žˆ์–ด์„œ ๋น„์ „๋ฌธ๊ฐ€๋„ ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์›€ (ํŠน์„ฑ์ด ๋ช‡๊ฐœ ์—†์–ด ํŠธ๋ฆฌ๊ฐ€ ๋น„๊ต์  ์ž‘์„ ๋•Œ)
  • ํŠน์„ฑ๊ฐ’์˜ ์Šค์ผ€์ผ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์•„ ํ‘œ์ค€ํ™” ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•  ํ•„์š”๊ฐ€ ์—†์Œ
  • ์ˆ˜์น˜ํ˜•๊ณผ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ํ•œ๊บผ๋ฒˆ์— ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Œ
  • ํŠน์„ฑ ์ค‘์š”๋„ํŠน์„ฑ์ด ๋ถˆ์ˆœ๋„๋ฅผ ๊ฐ์†Œํ•˜๋Š”๋ฐ ๊ธฐ์—ฌํ•œ ์ •๋„๋ฅผ ๊ณ„์‚ฐ ํ•  ์ˆ˜ ์žˆ์Œ
    • ํŠน์„ฑ ์„ ํƒ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ

๋‹จ์ 

  • ๊ฐ€์ง€์น˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๊ณผ๋Œ€์ ํ•ฉOverfitting๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Œ

=> ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ•์„ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๋Œ€์•ˆ์œผ๋กœ ์‚ฌ์šฉ (๋‹ค์Œ ์žฅ์—์„œ!)