반응형
5.4 로지스틱 회귀분석을 이용한 문서 분류
5.4.2 라쏘 회귀를 이용한 특성 선택
- 라쏘 회귀: 특성의 계수에 대해 정규화를 하지만 L1 정규화 사용
lasso_clf = LogisticRegression(penalty='l1', solver='liblinear', C=1) # Lasso는 동일한 LogisticRegression을 사용하면서 매개변수로 지정
lasso_clf.fit(X_train_tfidf, y_train) # train data로 학습
print('#Train set score: {:.3f}'.format(lasso_clf.score(X_train_tfidf, y_train)))
print('#Test set score: {:.3f}'.format(lasso_clf.score(X_test_tfidf, y_test)))
# 계수(coefficient) 중에서 0이 아닌 것들의 개수를 출력
print('#Used features count: {}'.format(np.sum(lasso_clf.coef_ != 0)), 'out of', X_train_tfidf.shape[1])
"""
#Train set score: 0.819
#Test set score: 0.724
#Used features count: 437 out of 2000
"""
top10_features(lasso_clf, tfidf, newsgroups_train.target_names)
"""
alt.atheism: bobby, atheism, atheists, islam, religion, islamic, motto, atheist, satan, vice
comp.graphics: graphics, image, 3d, file, computer, hi, video, files, looking, sphere
sci.space: space, orbit, launch, nasa, spacecraft, flight, moon, dc, shuttle, solar
talk.religion.misc: fbi, christian, christians, christ, order, jesus, children, objective, context, blood
"""
5.5 결정트리 등을 이용한 기타 문서 분류 방법
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
tree = DecisionTreeClassifier(random_state=7)
tree.fit(X_train_tfidf, y_train)
print('#Decision Tree train set score: {:.3f}'.format(tree.score(X_train_tfidf, y_train)))
print('#Decision Tree test set score: {:.3f}'.format(tree.score(X_test_tfidf, y_test)))
forest = RandomForestClassifier(random_state=7)
forest.fit(X_train_tfidf, y_train)
print('#Random Forest train set score: {:.3f}'.format(forest.score(X_train_tfidf, y_train)))
print('#Random Forest test set score: {:.3f}'.format(forest.score(X_test_tfidf, y_test)))
gb = GradientBoostingClassifier(random_state=7)
gb.fit(X_train_tfidf, y_train)
print('#Gradient Boosting train set score: {:.3f}'.format(gb.score(X_train_tfidf, y_train)))
print('#Gradient Boosting test set score: {:.3f}'.format(gb.score(X_test_tfidf, y_test)))
"""
#Decision Tree train set score: 0.977
#Decision Tree test set score: 0.536
#Random Forest train set score: 0.977
#Random Forest test set score: 0.685
#Gradient Boosting train set score: 0.933
#Gradient Boosting test set score: 0.696
"""
sorted_feature_importances = sorted(zip(tfidf.get_feature_names_out(), gb.feature_importances_), key=lambda x: x[1], reverse=True)
for feature, value in sorted_feature_importances[:40]:
print('%s: %.3f' % (feature, value), end=', ')
## space: 0.126, graphics: 0.080, atheism: 0.024, thanks: 0.023, file: 0.021, orbit: 0.020, jesus: 0.018, god: 0.018, hi: 0.017, nasa: 0.015, image: 0.015, files: 0.014, christ: 0.010, moon: 0.010, bobby: 0.010, launch: 0.010, christian: 0.010, looking: 0.010, atheists: 0.009, christians: 0.009, fbi: 0.009, 3d: 0.008, you: 0.008, not: 0.008, islamic: 0.007, religion: 0.007, spacecraft: 0.007, flight: 0.007, computer: 0.007, islam: 0.007, ftp: 0.006, color: 0.006, software: 0.005, atheist: 0.005, card: 0.005, people: 0.005, koresh: 0.005, his: 0.005, kent: 0.004, sphere: 0.004,
※ 해당 내용은 <파이썬 텍스트 마이닝 완벽 가이드>의 내용을 토대로 학습하며 정리한 내용입니다.
반응형
'텍스트 마이닝' 카테고리의 다른 글
BOW 기반의 문서 분류 (6) (0) | 2023.07.05 |
---|---|
BOW 기반의 문서 분류 (5) (0) | 2023.07.04 |
BOW 기반의 문서 분류 (3) (0) | 2023.07.02 |
BOW 기반의 문서 분류 (2) (0) | 2023.07.01 |
BOW 기반의 문서 분류 (1) (0) | 2023.06.30 |