하이퍼파라메터 튜닝, Automl로 자동화 하자! (w/pycaret)

데이터 분석/왕초보를 위한 머신러닝

하이퍼파라메터 튜닝, Automl로 자동화 하자! (w/pycaret)

ai-creator 2021. 11. 27. 13:05

ㅁ Pycaret이란?

- pycaret이란 AutoML을 하게 해주는 파이썬 라이브러리

- scikit-learn 패키지를 기반으로 하고 있으며 Classification, Regression, Clustering, Anomaly Detection 등등 다양한 모델을 지원

- 공식문서에 설명이 매우 잘 되어있고, 몇 줄의 코드로 쉽게 구현이 가능하여 유용하게 사용할 수 있음

ㅁ Pycaret API

(참고) pycaret 문서

https://pycaret.readthedocs.io/en/latest/api/classification.html

Classification — pycaret 2.2.0 documentation

Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2]. Ignored when polynomial_features is not True.

pycaret.readthedocs.io

※ pycaret은 scikit-learn 버전을 0.23.2만 지원하고 있음

$ pip install scikit-learn==0.23.2
$ pip install pycaret

ㅁ (실습) 자전거 수요예측

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pycaret.regression import *

1. 데이터로딩 및 EDA

train = pd.read_csv('data/bike/train.csv') 
test = pd.read_csv('data/bike/test.csv')

sns.boxplot(train['count']) 
plt.show()
 
sns.histplot(train['count']) 
plt.show()

결과

2. 데이터 정제

2-1) 로그 변환

train['log_cnt'] = train['count'].apply(lambda x: np.log(x))
sns.histplot(train['log_cnt'])

결과

2-2) datetime에서 파생변수 추출

new_df = train
new_df['datetime'] = pd.to_datetime(new_df['datetime'])
new_df['month']=new_df['datetime'].apply(lambda x:x.month) 
new_df['hour']=new_df['datetime'].apply(lambda x:x.hour) 
new_df['day_of_week']=new_df['datetime'].apply(lambda x:x.day_name()) 
new_df['year']=new_df['datetime'].apply(lambda x:x.year)

# feature selection
feature_list = ['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed',
'log_cnt', 'month', 'hour', 'year', 'day_of_week']
final_df = new_df[feature_list]

3. 학습데이터 세팅

- setup()을 통해서 추후에 필요한 모든 환경을 initialize

- 예시에서는 input feature와 target column을 지정

reg = setup(final_df, target='log_cnt') # 다중공선성 처리 가능 -> , remove_multicollinearity = True, multicollinearity_threshold = 0.6)

4. 모델 학습 및 평가

models = compare_models()

결과

참고) 특정 모델로 학습

- create_mode() : 특정 ML을 지정하여 학습할 수 있음.

model = create_model('lightgbm') # fold=10 의 결과 값을 확인할 수 있다.
tuned = tune_model(model, optimize='RMSLE', n_iter=1000) 
final_model = finalize_model(tuned)

결과

최적화한 모델의 hyper-parameter 확인

final_model

결과

LGBMRegressor(bagging_fraction=0.9, bagging_freq=5, boosting_type='gbdt',
              class_weight=None, colsample_bytree=1.0, feature_fraction=0.8,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=61, min_child_weight=0.001, min_split_gain=0,
              n_estimators=180, n_jobs=-1, num_leaves=200, objective=None,
              random_state=8972, reg_alpha=0.0001, reg_lambda=0.001,
              silent='warn', subsample=1.0, subsample_for_bin=200000,
              subsample_freq=0)

5. 최종 모델의 feature importacne

plot_model(final_model, plot='feature')

결과

저작자표시 비영리 변경금지