๐Ÿ‘ฆ ๋‚ด์ผ๋ฐฐ์›€์บ ํ”„/TIL(Today I Learned)

TIL_220513_๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋กœ์ ํŠธ ๊ธฐ์ดˆ

MVMT 2023. 1. 1. 00:57

2์ฃผ์ฐจ ๋จธ์‹ ๋Ÿฌ๋‹..

 

์ด๋ฏธ ๋งค์›Œ์„œ ํ˜€๊ฐ€ ์–ผ์–ผํ•œ๋ฐ..

 

์ ์  ํ†ต๊ฐ์ด ์Œ”์ง€๋Š” ๊ธฐ๋ถ„์ด๋ž„๊นŒ..

 

์ฐ๋จนํ•˜๋ ค๋‹ค ํฌ๊ฒŒ ํ˜ผ๋‚˜๋Š” ์ค‘์ด๋‹ค.

 

์ดํ•ด์˜ ๋ฒ”์ฃผ์— ๊ณ„์† ํŒ…๊ฒจ์ ธ ๋‚˜๊ฐ€๋ฒ„๋ฆฌ๋‹ˆ..

 

๋‚ ์•„๊ฐ€๋Š” ๋ฉ˜ํƒˆ ์žก๋Š” ์ค‘..

 

์˜ค๋Š˜๋„ ํ™”์ดํŒ…!!!


Logistic regression (๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€)

  • ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’์€ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์ข…์† ๋ณ€์ˆ˜์™€ ๋…๋ฆฝ ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.
  • ์„ ํ˜• ํšŒ๊ท€์ฒ˜๋Ÿผ ์—ฐ์†๋œ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ข…์†๋ณ€์ˆ˜๊ฐ€ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ์ผ ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค.
  • ํšŒ๊ท€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒ”์ฃผ์— ์†ํ•  ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•œ๋‹ค.

 

  • ๋…๋ฆฝ ๋ณ€์ˆ˜ : ๋…๋ฆฝ ๋ณ€์ˆ˜๋Š” ์ž…๋ ฅ๊ฐ’์ด๋‚˜ ์›์ธ
  • ์ข…์† ๋ณ€์ˆ˜ : ์ข…์† ๋ณ€์ˆ˜๋Š” ๊ฒฐ๊ณผ๋ฌผ์ด๋‚˜ ํšจ๊ณผ
  • ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ : 0 ๋˜๋Š” 1์ฒ˜๋Ÿผ ์ด์ง„์œผ๋กœ ๋‚˜ํƒ€๋‚˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ
  • ์ดํ•ญ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ : ์ข…์† ๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ธ binary ํ˜•ํƒœ์ผ ๋•Œ EX) ๋‚ ์”จ(hot, cold)
  • ๋‹คํ•ญ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ : ์ข…์† ๋ณ€์ˆ˜๊ฐ€ 3๊ฐœ ์ด์ƒ multi ํ˜•ํƒœ์ผ ๋•Œ EX) ๋‚ ์”จ(rainy, sunny, cloudly)

 

์˜ˆ์‹œ) ๊ณต๋ถ€ํ•œ ์‹œ๊ฐ„์— ๋Œ€ํ•œ ํ•ด๋‹น ๊ณผ๋ชฉ์˜ ์ด์ˆ˜ ์—ฌ๋ถ€ ์˜ˆ์ธก

์ด์ง„๋ถ„๋ฅ˜

  • ๋ฌธ์ œ)
    • ์ง์„ ์œผ๋กœ ๊ทธ๋ ค์กŒ๊ธฐ ๋•Œ๋ฌธ์— 2์‹œ๊ฐ„ ์ด์ƒ ๊ณต๋ถ€ํ•˜์ง€ ์•Š์œผ๋ฉด ํ•ฉ๊ฒฉ ํ™•๋ฅ ์ด ์Œ์ˆ˜๊ฐ€ ๋œ๋‹ค.
    • ์ •ํ™•๋„๊ฐ€ ๋‚ฎ๋‹ค.
  • ํ•ด๊ฒฐ) ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

๋จธ์‹ ๋Ÿฌ๋‹ : ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€

  • ํšŒ๊ท€ ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” Sigmoid function (์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜)
  • S-์ปค๋ธŒ๋ฅผ ํ•จ์ˆ˜๋กœ ํ‘œํ˜„ํ•ด๋‚ธ ๊ฒƒ.
  • X์ถ•์—๋Š” (์กฐ๊ฑด์„ ๋ชจ๋‘ ๊ณ ๋ คํ•˜์˜€์„ ๋•Œ์˜) ์ ์ˆ˜, Y์ถ•์œผ๋กœ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚˜ ์žˆ๋Š”๋ฐ, ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ์•Œ๋‹ค์‹œํ”ผ 0๊ณผ 1๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„ ๋ชจ์Šต์„ ํ™•์ธ
  • ์ฆ‰, ์‚ฌ๊ฑด์ด ์ผ์–ด๋‚˜๊ณ (1) ์ผ์–ด๋‚˜์ง€ ์•Š๊ณ (0)๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์˜ ๋ชฉํ‘œ
  • ์˜ˆ์‹œ)
    • ํ™”์žฌ๊ฐ€ ๋ฐœ์ƒํ–ˆ์„ ๋•Œ, ์‚ฐ์†Œ๊ฐ€ X๋งŒํผ ๋ถ€์กฑํ•ด์ง€๋ฉด '์‚ฌ๋งํ•œ๋‹ค or ์‚ฌ๋งํ•˜์ง€ ์•Š๋Š”๋‹ค.'
    • ๊ตํ†ต์‚ฌ๊ณ ๊ฐ€ ๋ฐœ์ƒํ–ˆ์„ ๋•Œ, ์ถฉ๊ฒฉ๋Ÿ‰์ด X์ผ ๋•Œ '์ค‘์ƒ์„ ์ž…๋Š”๋‹ค or ์ž…์ง€ ์•Š๋Š”๋‹ค.'
    • ํƒ€์ดํƒ€๋‹‰์—์„œ ์‚ฌ๊ณ ๊ฐ€ ๋ฒŒ์–ด์กŒ์„ ๋•Œ, X๋ผ๋Š” ์กฐ๊ฑด์ด ์ฃผ์–ด์ง€๋ฉด '์‚ฌ๋งํ•œ๋‹ค or ์‚ฌ๋งํ•˜์ง€ ์•Š๋Š”๋‹ค.'
Tistory, [์ธ๊ณต์ง€๋Šฅ][๊ฐœ๋…] ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€(Logistic Regression)๋Š” ๋ฌด์—‡์ด๋ฉฐ, ์‹œ๊ทธ๋ชจ์ด๋“œ(Sigmoid) ํ•จ์ˆ˜๋Š” ์™œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ผ๊นŒ?, https://itstory1592.tistory.com/8 , (2022.05.13)

 

 

๋‹จํ•ญ ๋…ผ๋ฆฌ ํšŒ๊ท€ :

  • Sigmoid ๋ฅผ ์‚ฌ์šฉํ•ด 0 ๊ณผ 1๋กœ ๋‚˜๋ˆ„๊ณ  Crossentropy ๋ฅผ ์‚ฌ์šฉํ•ด ํ™•๋ฅ  ๋ถ„ํฌ ๊ทธ๋ž˜ํ”„์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ์ตœ์†Œํ™” ํ•œ๋‹ค.

 

๋‹คํ•ญ ๋…ผ๋ฆฌ ํšŒ๊ท€ :

  • Sigmoid ๋Œ€์‹  Softmax ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ๋˜‘๊ฐ™์ด Crossentropy ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ํ™•๋ฅ  ๋ถ„ํฌ ๊ทธ๋ž˜ํ”„์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ์ตœ์†Œํ™” ํ•œ๋‹ค.

 


 

Support vector machine (SVM)

  • ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ์€ ๋ถ„๋ฅ˜ ๋ฌธ์ œ(Classification problem)
  • ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๋ชจ๋ธ์„ ๋ถ„๋ฅ˜๊ธฐ(Classifier)
  • ๊ทธ๋ž˜ํ”„์˜ ์ถ•์„ Feature(ํŠน์ง•)๋ผ๊ณ  ๋ถ€๋ฅด๊ณ  ๊ฐ ๊ณ ์–‘์ด, ๊ฐ•์•„์ง€์™€ ์šฐ๋ฆฌ๊ฐ€ ๊ทธ๋ฆฐ ๋นจ๊ฐ„ ๋ฒกํ„ฐ๋ฅผ Support vector, ๋ฒกํ„ฐ์˜ ๊ฑฐ๋ฆฌ๋ฅผ Margin
  • ์šฐ๋ฆฌ๋Š” Margin์ด ๋„“์–ด์ง€๋„๋ก ์ด ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ ํ›Œ๋ฅญํ•œ Support vector machine
  • ์˜ˆ์™ธ ์ƒํ™ฉ ๋ฐœ์ƒ
    • Feature(ํŠน์„ฑ)์˜ ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ ค์„œ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ 

 

๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ๊ฐ„๋‹จ ์†Œ๊ฐœ

k-Nearest neighbors (KNN)

  • ์ƒˆ๋กœ ๋‚˜ํƒ€๋‚ฌ์„ ๋•Œ ์ผ์ • ๊ฑฐ๋ฆฌ์•ˆ์— ๋‹ค๋ฅธ ๊ฐœ์ฒด๋“ค์˜ ๊ฐœ์ˆ˜(k)๋ฅผ ๋ณด๊ณ  ์ž์‹ ์˜ ์œ„์น˜๋ฅผ ๊ฒฐ์ •ํ•˜๊ฒŒํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

 

Decision tree (์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด)

  • ์Šค๋ฌด๊ณ ๊ฐœ์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์˜ˆ, ์•„๋‹ˆ์˜ค๋ฅผ ๋ฐ˜๋ณตํ•˜๋ฉฐ ์ถ”๋ก ํ•˜๋Š” ๋ฐฉ์‹
  • ์ƒ๊ฐ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„ ๊ฐ„๋‹จํ•œ ๋ฌธ์ œ๋ฅผ ํ’€ ๋•Œ ์ž์ฃผ ์‚ฌ์šฉ

 

Random forest

  • ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ํ•ฉ์นœ ๋ชจ๋ธ
  • ๊ฐ๊ฐ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋“ค์ด ๊ฒฐ์ •์„ ํ•˜๊ณ  ๋งˆ์ง€๋ง‰์— ํˆฌํ‘œ(Majority voting)์„ ํ†ตํ•ด ์ตœ์ข… ๋‹ต์„ ๊ฒฐ์ •

 


 

์ „์ฒ˜๋ฆฌ(Preprocessing)

  • ๋„“์€ ๋ฒ”์œ„์˜ ๋ฐ์ดํ„ฐ ์ •์ œ ์ž‘์—…์„ ๋œป.
  • ํ•„์š”์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ง€์šฐ๊ณ  ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋งŒ์„ ์ทจํ•˜๋Š” ๊ฒƒ.
  • null ๊ฐ’์ด ์žˆ๋Š” ํ–‰์„ ์‚ญ์ œํ•˜๋Š” ๊ฒƒ.
  • ์ •๊ทœํ™”(Normalization), ํ‘œ์ค€ํ™”(Standardization) ๋“ฑ์˜ ๋งŽ์€ ์ž‘์—…๋“ค์„ ํฌํ•จ.

์ •๊ทœํ™” (Normalization)

  • ๋ฐ์ดํ„ฐ๋ฅผ 0๊ณผ 1์‚ฌ์ด์˜ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋„๋ก ๋งŒ๋“ฌ
  • ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ ์ค‘์—์„œ ๊ฐ€์žฅ ์ž‘์€ ๊ฐ’์„ 0์œผ๋กœ ๋งŒ๋“ค๊ณ , ๊ฐ€์žฅ ํฐ ๊ฐ’์„ 1๋กœ ๋งŒ๋“ฌ

ํ‘œ์ค€ํ™” (Standardization)

ํ‘œ์ค€ํ™”๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ์ •๊ทœ๋ถ„ํฌ๋กœ ๋ฐ”๊ฟˆ.

์ฆ‰ ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท ์ด 0์ด ๋˜๋„๋กํ•˜๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ด ๋˜๋„๋ก ๋งŒ๋“ค์–ด์คŒ.

 

# ์ผ๋‹จ ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท ์„ 0์œผ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋ฉด ๋ฐ์ดํ„ฐ์˜ ์ค‘์‹ฌ์ด 0์— ๋งž์ถฐ์ง€๊ฒŒ(Zero-centered) ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ 1๋กœ ๋งŒ๋“ค์–ด ์ฃผ๋ฉด ๋ฐ์ดํ„ฐ๊ฐ€ ์˜ˆ์˜๊ฒŒ ์ •๊ทœํ™”(Normalized) ๋˜์ฃ . ์ด๋ ‡๊ฒŒ ํ‘œ์ค€ํ™”๋ฅผ ์‹œํ‚ค๊ฒŒ ๋˜๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ํ•™์Šต ์†๋„(์ตœ์ €์  ์ˆ˜๋ ด ์†๋„)๊ฐ€ ๋น ๋ฅด๊ณ , Local minima์— ๋น ์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์ ์Šต๋‹ˆ๋‹ค.
# ๊ฑฐ์˜ 99% ๋ชจ๋ธ์ด Normalized ์„ ์‚ฌ์šฉ.

2์ฃผ์ฐจ ์ˆ™์ œ

 

์—ฐ๋ น, ํ˜ˆ์••, ์ธ์Š๋ฆฐ ์ˆ˜์น˜ ๋“ฑ์„ ํ†ตํ•ด ๋‹น๋‡จ๋ณ‘์„ ์ง„๋‹จํ•ด๋ด…์‹œ๋‹ค!

 

๋…ผ๋ฆฌ ํšŒ๊ท€๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•ด๋ณด์ž.

 

์‹ค์Šต

 

1) 

import os
os.environ['KAGGLE_USERNAME'] = 'movvvv' # username
os.environ['KAGGLE_KEY'] = '5af4a2875c7c2abf94db7964afa4633b' # key

!kaggle datasets download -d kandij/diabetes-dataset
!unzip diabetes-dataset.zip

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # StandardScaler ์ „์ฒ˜๋ฆฌ ๊ณผ์ •

df = pd.read_csv('diabetes2.csv')

df.head(5)

x_data = df.drop(columns=['Outcome'], axis=1)
x_data = x_data.astype(np.float32)

y_data = df[['Outcome']]
y_data = y_data.astype(np.float32)

scaler = StandardScaler()
x_data_scaled = scaler.fit_transform(x_data)

x_train, x_val, y_train, y_val = train_test_split(x_data_scaled, y_data, test_size=0.2, random_state=2021)

model = Sequential([
  Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.01), metrics=['acc'])

model.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val), # ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด ํ•œ epoch์ด ๋๋‚ ๋•Œ๋งˆ๋‹ค ์ž๋™์œผ๋กœ ๊ฒ€์ฆ
    epochs=20 # epochs ๋ณต์ˆ˜ํ˜•์œผ๋กœ ์“ฐ๊ธฐ!
)

 

2) ๋‹น๋‡จ๋ณ‘ ์˜ˆ์ธก : ์•ฝ 78%

 

https://colab.research.google.com/drive/1GrDwUsUOzVHmWevSFHkeduEItv4Qw11Z?usp=sharing 

 

2์ฃผ์ฐจ ์ˆ™์ œ

Colaboratory notebook

colab.research.google.com

 

3) ํ”ผ๋“œ๋ฐฑ

  • x_data = df.drop(columns=['Outcome'], axis=1)
    x_data = x_data.astype(np.float32)

    y_data = df[['Outcome']]
    y_data = y_data.astype(np.float32)

    scaler = StandardScaler()
    x_data_scaled = scaler.fit_transform(x_data)

 

์ด ๋ถ€๋ถ„๋“ค์— ๋Œ€ํ•ด์„œ ์ œ๋Œ€๋กœ ์ดํ•ดํ•˜์ง€ ๋ชปํ–ˆ๋‹ค..

 

๊ผญ ๋ฉ”๋ชจํ•ด๋‘์–ด ์งˆ๋ฌธ ์‚ฌํ•ญ์œผ๋กœ ์›ํ•˜๋Š” ๋‹ต์„ ์–ป์ž.

 

 

 

 

 

์ถœ์ฒ˜ ์ŠคํŒŒ๋ฅดํƒ€์ฝ”๋”ฉํด๋Ÿฝ