통계적 기반의 연관어 분석

Python 2020. 3. 4. 12:28

# 패키지 로딩하기
import pandas as pd
import numpy as np
import glob
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from afinn import Afinn
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer

# 100개의 데이터 읽어오기
pos_review = glob.glob("d:/deeplearning/textmining/pos/*.txt")[0:100]

# 벡터에 100개의 파일 넣기
pos_lines = []
for i in pos_review:
try:
f = open(i, "r")
temp = f.readlines()[0]
pos_lines.append(temp)
f.close
except Exception as e:
continue

len(pos_lines)

# 단어 추출하기
stop_words = stopwords.words("english")
vec = TfidfVectorizer(stop_words = stop_words)
pos_vector_lines = vec.fit_transform(pos_lines)

# 코사인 유사도 구하기
pos_A = pos_vector_lines.toarray()
pos_A = pos_A.transpose()
pos_A_sparse = sparse.csc_matrix(pos_A)
pos_similarity_sparse = cosine_similarity(pos_A_sparse, dense_output = False)

# 데이터 프레임 만들기
df = pd.DataFrame(list(pos_similarity_sparse.todok().items()), columns = ["words", "weight"])

df2 = df.sort_values(by = ["weight"], ascending = False)
df2 = df2.reset_index(drop = True)
df3 = df2.loc[np.round(df2["weight"]) < 1]
df3 = df3.reset_index(drop = True)
df3.head()

[출처] 잡아라! 텍스트마이닝 with 파이썬, 서대호 지음, BJ, p120~124

저작자표시 변경금지 (새창열림)

'Python' 카테고리의 다른 글

matplotlib의 rcParams 속성들 (0)	2021.09.09
word2vec 기반 연관어 분석 (0)	2020.03.04
사전 기반의 감성분석(Sentiment Analysis) (0)	2020.03.03
LDA(Latent Dirichlet Allocation) (0)	2020.03.02
텍스트 구조적 군집분석 (0)	2020.03.02

ABOUT ME

buillee buillee

'Python' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Python' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바