rexamples/topicmodeling_stm.Rmd at master · rosemm/rexamples · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# Structural topic modeling

How to use the stm package (from the [stm vignette](https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf)):
![stm workflow](/Users/TARDIS/Dropbox/RClub/stm_package_workflow.png)

## The topic modeling part
Find topics in your data!
```{r, eval=FALSE}
install.packages("stm", "SnowballC") # probably new
install.packages("dplyr", "tidyr") # if you don't have them already
```

Getting my data ready. Note that your data should be a data frame where each row has one document. You should have a column called "documents" that has all of the text. Any other variables can be added as additional columns.

```{r my_data, message=FALSE, error=FALSE}
df <- read.table("/Users/TARDIS/Documents/STUDIES/context_word_seg/utt_orth_phon_KEY.txt", header=1, sep="\t", stringsAsFactors=F, quote="", comment.char ="") # this gets used for word-lists contexts, it will get over-written for other contexts
library(dplyr); library(tidyr)
data <- df %>%
  select(-phon) %>%
  extract(col=utt, into=c("child", "age_weeks"), regex="^([[:alpha:]]{2})([[:digit:]]{2})")
data$temp <- gl(n=ceiling(nrow(data)/30), k=30)[1:nrow(data)]
data <- group_by(data, child, age_weeks, temp) %>%
  summarize(documents=paste(orth, collapse=" ")) %>%
  select(-temp)
```

Now bring it to stm for processing there.

```{r stm_processing}
library(stm)
processed <- textProcessor(data$documents, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta) # removes infrequent terms depending on user-set parameter lower.thresh (the minimum number of documents a word needs to appear in in order for the word to be kept within the vocabulary)
```

Take a look at those messages. I left everything at default here, but you may or may not want to, depending on your research question.

Now let's get some topics!! (This might take a looooong time to run.)
```{r topic_model0, cache=TRUE}
fit0 <- stm(out$documents, # the documents
            out$vocab, # the words
            K = 10, # 10 topics
            max.em.its = 75, # set to run for a maximum of 75 EM iterations
            data = out$meta, # all the variables (we're not actually including any predictors in this model, though)
            init.type = "Spectral")
```
Note: "The default is init.type = "LDA" but in practice researchers on personal computers with vocabularies less that 10,000 can utilize the spectral initialization successfully." And spectral initialization is better, so you should do that. If you have a very large dataset, read up on how to correctly use the LDA option in the [stm vignette](https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf).

Yay! Let's look:
```{r}
labelTopics(fit0)
```
For more information on FREX and high probability rankings, see Roberts et al. (2013, 2015, 2014); Lucas et al. (2015). For more information on score, see the [lda R package](http://cran.r-project.org/web/packages/lda/lda.pdf). For more information on lift, see Taddy (2013).

```{r}
plot.STM(fit0, type = "labels")
plot.STM(fit0, type = "summary")

# topicCorr(fit0) # lots of output :)
round(topicCorr(fit0)$cor, 2) # just the correlations between topics
```

### How many topics?
The right answer depends on your corpus, the size of your documents, and your research question. Even when you have a clear research question, it still may be tricky to decide how many topics will make the most sense. Especially when you're starting out and you don't necessarily know how the chracteristics of your corpus/documents will affect your topic models, you might want to try out several different numbers of topics.
You can run several models with different numbers of topics and compare them to help you figure out what makes the most sense using the searchk() function. This code runs a model with 7 topics and another with 10.
```{r, cache=TRUE}
ntopics <- searchK(out$documents, out$vocab, K = c(7, 10), data = meta)
plot(ntopics)
```

## The structural part
The idea is that you can use the idea of topic modeling within a larger model that includes predictors, etc. that you think may change what the topics are like or how often particular topics show up.
There are two different flavors of predictors to think about: those that change how much each topic shows up (prevalence), and those that change

For example (from the [stm vignette](https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf)), maybe you have lots of posts from a handful of political blogs, and you have the blogs coded as either liberal or conserative, and you know the date each post was written.
You might expect that some topics are more likely to be discussed more often on liberal blogs vs. conservative ones. That's a difference in prevalence.
Or you might think that some topics will be more or less likely as time goes on (e.g. maybe you expect an increase in posts about a certain candidate as he or she wins more primaries). Again, that's prevalence.

If you put in no predictors at all, the model reduces to a fast implementation of the [Correlated Topic Model (Blei and Lafferty 2007)](http://projecteuclid.org/euclid.aoas/1183143727). In other words, it's regular ol' LDA, but the topics themselves are allowed to covary.

```{r topic_model1, cache=TRUE}
fit1 <- stm(out$documents, # the documents
            out$vocab, # the words
            content =~ child,
            prevalence =~ s(age_weeks), # use spline smoothing, so it's not forcing linear relationships between age and topic prevalence
            K = 10, # 10 topics
            max.em.its = 75, # set to run for a maximum of 75 EM iterations
            data = out$meta, # all the variables (we're not actually including any predictors in this model, though)
            init.type = "Spectral")
```


```{r, eval=FALSE}
sageLabels(fit1) # only works with content covariates in the model
plot(fit1)
```