기계학습/인공지능및기계학습개론정리

인공지능및기계학습개론 lecture2: regression 구현

H_erb Salt 2020. 7. 28. 15:51

 

 

 

 

 

구현1. linear regression

  • x의 1차항만 고려하는 선형회귀(Linear Regression) 모형
  • 13개의 Attribute 중 첫 번째 Attribute만 Feature varaible로 활용함 (강의에서 첫 번째 항은 더미데이터처럼 1이라 한 것 기억!)

  • $\hat \theta = argmin_{\theta}(f-\hat f)^2 ~~>> \theta = (X^TX)^{-1}X^TY$
  • y_est(= x_temp * θ): 위에서 구해진 theta로 도출된 예측치
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:.5f}'.format
In [2]:
x=pd.read_csv('X.csv', header=None)
y=pd.read_csv('Y.csv', header=None)
x
Out[2]:
  0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 0.00632 18.00000 2.31000 0 0.53800 6.57500 65.20000 4.09000 1 296 15.30000 396.90000 4.98000
1 1 0.02731 0.00000 7.07000 0 0.46900 6.42100 78.90000 4.96710 2 242 17.80000 396.90000 9.14000
2 1 0.02729 0.00000 7.07000 0 0.46900 7.18500 61.10000 4.96710 2 242 17.80000 392.83000 4.03000
3 1 0.03237 0.00000 2.18000 0 0.45800 6.99800 45.80000 6.06220 3 222 18.70000 394.63000 2.94000
4 1 0.06905 0.00000 2.18000 0 0.45800 7.14700 54.20000 6.06220 3 222 18.70000 396.90000 5.33000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 1 0.06263 0.00000 11.93000 0 0.57300 6.59300 69.10000 2.47860 1 273 21.00000 391.99000 9.67000
502 1 0.04527 0.00000 11.93000 0 0.57300 6.12000 76.70000 2.28750 1 273 21.00000 396.90000 9.08000
503 1 0.06076 0.00000 11.93000 0 0.57300 6.97600 91.00000 2.16750 1 273 21.00000 396.90000 5.64000
504 1 0.10959 0.00000 11.93000 0 0.57300 6.79400 89.30000 2.38890 1 273 21.00000 393.45000 6.48000
505 1 0.04741 0.00000 11.93000 0 0.57300 6.03000 80.80000 2.50500 1 273 21.00000 396.90000 0.00000

506 rows × 14 columns

In [3]:
x = x.values
y = y.values
In [4]:
x_temp = x[:, 0:2]

theta = np.dot(np.dot(np.linalg.inv(np.dot(np.transpose(x_temp), x_temp)), np.transpose(x_temp)), y)
y_est = np.dot(x_temp, theta)
 

np.linalg.lstsq() 명령은 행렬 A 와 b 를 모두 인수로 받고

  1. 최소자승문제(least square problem)의 답 x,
  2. 잔차제곱합(residual sum of squares) resid,
  3. 랭크(rank) rank,
  4. 특잇값(singular value) s를 반환한다.

미지수와 방정식의 개수가 같고 행렬 A 의 역행렬이 존재하면 최소자승문제의 답과 선형 연립방정식의 답이 같으므로 lstsq() 명령으로 선형 연립방정식을 풀 수도 있다.

In [5]:
np.linalg.lstsq(x_temp, y, rcond=None)

# rcond
# Cut-off ratio for small singular values of a.
# For the purposes of rank determination,
# singular values are treated as zero if they are smaller than rcond times the largest singular value of a.
Out[5]:
(array([[24.03310378],
        [-0.41519   ]]),
 array([36275.52515523]),
 2,
 array([209.87398774,  20.71756301]))
In [6]:
# m0, c0 = argmin |Y - (m0 * xYemp + c0)|^2
# m1, c1 = argmin |Y_est - (m1 * xYemp + c1)|^2

m0, c0 = np.linalg.lstsq(x_temp, y, rcond=None)[0]
m1, c1 = np.linalg.lstsq(x_temp, y_est, rcond=None)[0]
 

구현2. polynomial regression

  • 8차항으로 이루어진 다항 회귀 모형
  • new_x: new_x[i] = $[1,~x_i,~x_i^2,~x_i^3~,~...~,~x_i^8]$
  • new_theta: 오차의 제곱을 최소화하는 parameter
  • new_y_est: new_theta로 측정된 예측치
In [7]:
new_x = np.zeros((x.shape[0], 9))
new_x
Out[7]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
In [8]:
new_x[:, 0:2] = x[:, 0:2]

for i in range(2, 9):
    new_x[:, i] = new_x[:, 1] * new_x[:, i-1]

new_theta = np.dot(np.dot(np.linalg.inv(np.dot(np.transpose(new_x), new_x)), np.transpose(new_x)),y)
new_y_est = np.dot(new_x, new_theta)

# m2, c2 = argmin |Y_est - (m2 * xYemp + c2)|^2
m2, c2 = np.linalg.lstsq(x_temp, new_y_est, rcond=None)[0]
 

구현3. 그래프

In [9]:
# 결과값을 그래프로 나타냄
# x축은 사용한 Feature Variable의 값이고, y축은 Dependent Variable의 값
plt.figure(1, figsize = (17, 5))

# 그래프 1
ax1 = plt.subplot(1, 3, 1)
plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], m0+c0*x[:, 1], 'r-')
plt.plot(x[:, 1], y_est, 'bo', markeredgecolor = 'none')
plt.plot(x[:, 1], m1+c1*x[:, 1], 'b-')
plt.xlabel('Feature Variable', fontsize = 14)
plt.ylabel('Dependent Variable', fontsize = 14)

# 그래프 2
ax2 = plt.subplot(1, 3, 2, sharey = ax1)
plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], new_y_est, 'go', markeredgecolor = 'none')
plt.plot(x[:, 1], m2 + c2*x[:, 1], 'g-')
plt.xlabel('Feature Variable', fontsize = 14)
plt.ylabel('Dependent Variable', fontsize = 14)

# 그래프3
ax3 = plt.subplot(1, 3, 3, sharey = ax2)
plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], y_est, 'bo', markeredgecolor = 'none')
plt.plot(x[:, 1], new_y_est, 'go', markeredgecolor = 'none')
plt.xlabel('Feature Variable', fontsize = 14)
plt.ylabel('Dependent Variable', fontsize = 14)

plt.show()
 
 

구현4. statsmodel: linear regression

In [10]:
from statsmodels.api import OLS

model1 = OLS(y, x_temp).fit()
p = model1.params
y_pred1 = model1.predict(x_temp)
model1.summary()
Out[10]:
OLS Regression Results
Dep. Variable: y R-squared: 0.151
Model: OLS Adj. R-squared: 0.149
Method: Least Squares F-statistic: 89.49
Date: Tue, 28 Jul 2020 Prob (F-statistic): 1.17e-19
Time: 06:40:53 Log-Likelihood: -1798.9
No. Observations: 506 AIC: 3602.
Df Residuals: 504 BIC: 3610.
Df Model: 1    
Covariance Type: nonrobust    
  coef std err t P>|t| [0.025 0.975]
const 24.0331 0.409 58.740 0.000 23.229 24.837
x1 -0.4152 0.044 -9.460 0.000 -0.501 -0.329
Omnibus: 139.832 Durbin-Watson: 0.713
Prob(Omnibus): 0.000 Jarque-Bera (JB): 295.403
Skew: 1.490 Prob(JB): 7.14e-65
Kurtosis: 5.264 Cond. No. 10.1


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [11]:
plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], y_pred1, 'bo', markeredgecolor = 'none')
plt.plot(x[:, 1], p[0] + p[1]*x[:, 1], 'b-')
Out[11]:
[<matplotlib.lines.Line2D at 0x7f5c762d2a50>]
 
 

구현5. statsmodel+scikit learn: polynomial regression

In [12]:
from sklearn.preprocessing import PolynomialFeatures

polynomial_features = PolynomialFeatures(degree=8)
xp = polynomial_features.fit_transform(x[:, 1:2])

print(xp.shape)
xp
 
(506, 9)
Out[12]:
array([[1.00000000e+00, 6.32000000e-03, 3.99424000e-05, ...,
        6.37239179e-14, 4.02735161e-16, 2.54528622e-18],
       [1.00000000e+00, 2.73100000e-02, 7.45836100e-04, ...,
        4.14887357e-10, 1.13305737e-11, 3.09437968e-13],
       [1.00000000e+00, 2.72900000e-02, 7.44744100e-04, ...,
        4.13067679e-10, 1.12726170e-11, 3.07629717e-13],
       ...,
       [1.00000000e+00, 6.07600000e-02, 3.69177760e-03, ...,
        5.03160559e-08, 3.05720356e-09, 1.85755688e-10],
       [1.00000000e+00, 1.09590000e-01, 1.20099681e-02, ...,
        1.73230980e-06, 1.89843831e-07, 2.08049854e-08],
       [1.00000000e+00, 4.74100000e-02, 2.24770810e-03, ...,
        1.13558522e-08, 5.38380953e-10, 2.55246410e-11]])
In [15]:
model2 = OLS(y, xp).fit()
y_pred2 = model2.predict(xp)
y_pred2.shape
Out[15]:
(506,)
In [16]:
model2.summary()
Out[16]:
OLS Regression Results
Dep. Variable: y R-squared: 0.219
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 19.97
Date: Tue, 28 Jul 2020 Prob (F-statistic): 1.24e-23
Time: 06:41:03 Log-Likelihood: -1777.7
No. Observations: 506 AIC: 3571.
Df Residuals: 498 BIC: 3605.
Df Model: 7    
Covariance Type: nonrobust    
  coef std err t P>|t| [0.025 0.975]
const 25.3876 0.518 48.984 0.000 24.369 26.406
x1 -1.6453 0.989 -1.663 0.097 -3.589 0.298
x2 0.1303 0.294 0.443 0.658 -0.448 0.708
x3 -0.0086 0.034 -0.256 0.798 -0.074 0.057
x4 0.0003 0.002 0.176 0.860 -0.003 0.004
x5 -6.584e-06 5.43e-05 -0.121 0.904 -0.000 0.000
x6 6.879e-08 8.63e-07 0.080 0.936 -1.63e-06 1.76e-06
x7 -3.34e-10 6.99e-09 -0.048 0.962 -1.41e-08 1.34e-08
x8 5.217e-13 2.26e-11 0.023 0.982 -4.39e-11 4.5e-11
Omnibus: 171.205 Durbin-Watson: 0.677
Prob(Omnibus): 0.000 Jarque-Bera (JB): 466.821
Skew: 1.676 Prob(JB): 4.28e-102
Kurtosis: 6.303 Cond. No. 5.32e+14


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.32e+14. This might indicate that there are
strong multicollinearity or other numerical problems.
In [17]:
plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], y_pred2, 'go', markeredgecolor = 'none')
Out[17]:
[<matplotlib.lines.Line2D at 0x7f5c751974d0>]
 
In [ ]: