머신러닝:: Dimensionality Reduction

Dimensionality Reduction

모델 경량화를 위해 사용

모델 제작이 쉽다, 관찰이 쉽다 등의 이점이 있음

Feature selection

원본 feature 유지
Best subset selection, Forward stepwise selection, Backward stepwise selection

Feature extraction

기존 feature에서 transform이나 project를 통해 새로운 feature space를 얻음
curse of dimensionality 감소로 성능을 향상시킬 수 있음

Principal Component Analysis (PCA)

Unsupervised Linear transformation technique for feature extraction

feature들 간의 상관관계를 바탕으로 패턴을 알아낼 수 있음
분산이 최대가 되는 방향을 찾는 것이 목표
- 분산이 최대가 된다 = 데이터를 잘 나타내는 방향이다

과정

1. Standardize the $d$-dimensional dataset
2. Construct the covariance matrix
3. Decompose the covariance matrix into its eigenvectors and eigenvalues
4. Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors
5. Select $k$ eigenvectors, which correspond to the $k$ largest eigenvalues $(k <= d)$
6. Construct a projection matrix $W$ from the top-k eigenvectors
7. Transform the $d$-dimensional input dataset $X$ using the projection matrix $W$ to obtain the new $k$-dimensional feature subspace

1. Standardize
$Var(x)$ vs $Var(ax) = a^2Var(x)$ : a의 영향을 많이 받으므로 Standardization을 한다.
방법 : (원본 - 평균) / 표준편차 (평균 = 0, 표준편차 = 1)

2. Covariance Matrix
$$
\Sigma =
\begin{bmatrix}
\sigma_1^2 & \cdots & \sigma_{1d} \
\vdots & \ddots & \vdots\
\sigma_{d1} & \cdots & \sigma_d^2
\end{bmatrix}
$$

3. PCA Formulation
목표: 데이터의 분산이 가장 크도록 하는 방향을 찾는 것$$
\sigma_a^2 = (Xa)^T(Xa) = a^TX^TXa = a^T\Sigma a \quad(where\,\, a^Ta = 1)
$$

4. Eigen Decomposition
위의 식을 라그랑주 승수법을 통해 나타내면 아래와 같이 나타낼 수 있다.
$$
C = a^T\Sigma a - \lambda(a^Ta - 1)
$$
위 식을 미분하면
$$
{{\partial C}\over{\partial a}}=2\Sigma a- 2\lambda a\
\therefore (\Sigma - \lambda I)a = 0
$$
$det(\Sigma - \lambda I) = 0$을 찾으면 eigenvalue, eigenvector를 찾을 수 있다.
이때 eigenvalue : 분산 / eigenvector : 축을 나타낸다.

Eigen Decomposition 연습문제 문풀

Explained Variance Ratio

$$
{\lambda_i} \over {\Sigma_{j=1}^d \lambda_j}
$$
전체 고윳값의 합에서 각 고윳값의 비율
k값에 따라 살아남을 정보의 양이 결정된다.

'Computer Science > 머신러닝' 카테고리의 다른 글

머신러닝:: Multi-Layer Perceptron(MLP) (0)	2023.06.12
머신러닝:: Ensemble Learning (0)	2023.06.11
머신러닝:: Regularization (0)	2023.06.08
머신러닝:: Clustering (0)	2023.06.06

Dimensionality Reduction

Feature selection

Feature extraction

Principal Component Analysis (PCA)

과정

Explained Variance Ratio

'Computer Science > 머신러닝' 카테고리의 다른 글

티스토리툴바