-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathNLPAnalysis.py
More file actions
71 lines (55 loc) · 2.69 KB
/
NLPAnalysis.py
File metadata and controls
71 lines (55 loc) · 2.69 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import gensim
import gensim.corpora as corpora
import pyLDAvis
import pyLDAvis.gensim_models
from wordcloud import WordCloud
def topic_modeling(word_tokens):
"""
Function for Topic Modeling. Takes word_tokens list <list> as Input and Outputs:
- the model <class 'gensim.models.ldamulticore.LdaMulticore'>,
- an analysis <class 'pyLDAvis._prepare.PreparedData'>,
- a wordcloud <class 'wordcloud.wordcloud.WordCloud'>.
General Information about Topic Models:
A type of statistical language models used for uncovering hidden structure in a collection of texts.
In a practical and more intuitively, you can think of it as a task of:
- Dimensionality Reduction:
where rather than representing a text T in its feature space as {Word_i: count(Word_i, T) for Word_i in
Vocabulary}, you can represent it in a topic space as {Topic_i: Weight(Topic_i, T) for Topic_i in Topics}
- Unsupervised Learning:
where it can be compared to clustering, as in the case of clustering, the number of topics, like the number
of clusters, is an output parameter. By doing topic modeling, we build clusters of words rather than
clusters of texts. A text is thus a mixture of all the topics, each having a specific weight
- Tagging:
abstract “topics” that occur in a collection of documents that best represents the information in them.
"""
# Create Dictionary
id2word = corpora.Dictionary([word_tokens])
# Create Corpus
texts = [word_tokens]
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# number of topics
num_topics = 3
# Build LDA model Pipeline
model = gensim.models.LdaMulticore(corpus=corpus,
id2word=id2word,
num_topics=num_topics)
# LDA Model analyzes
model_analysis = pyLDAvis.gensim_models.prepare(model, corpus, id2word)
model_analysis_as_html = pyLDAvis.save_html(model_analysis, 'clusters.html') # Save as .html file
# Create wordcloud
cloud = WordCloud(background_color='white',
width=2500,
height=1800,
max_words=10,
colormap='tab10',
color_func=lambda *args, **kwargs: (192, 150, 78),
prefer_horizontal=1.0)
topics = model.show_topics(formatted=False)
topic_words = dict(topics[0][1])
cloud = cloud.generate_from_frequencies(topic_words, max_font_size=300)
return [model,
model_analysis,
cloud,
model_analysis_as_html
]