khun84.github.io/parallel_coord_plot.Rmd at master · khun84/khun84.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: "Parallel Coordinate Plot"
output:
  html_document:
  css: style.css
theme: lumen
highlight: pygments
---

```{r include=F}
opts_chunk$set(cache = T)
```

The required library for this plot are **ggplot2**, **dplyr**, **knitr**, **stringr**, **rmarkdown** and **data.table**.

```{r 'load library', echo=FALSE}
suppressMessages(require(ggplot2))
suppressMessages(require(dplyr));suppressMessages(require(stringr))
suppressMessages(require(knitr))
suppressMessages(require(data.table))
suppressMessages(require(rmarkdown))
```

Populate some dummy datasets.

```{r}
data = data.frame(
        'l1' = sample(LETTERS[1:4], 20, T),
        'l2' = sample(LETTERS[5:10], 20, T),
        'l3' = sample(LETTERS[15:18], 20, T),
        'cluster' = sample(1:4, 20, T))

# a helper function to convert an object name into literal string
var_to_name <- function(x){
  deparse(substitute(x))
}

# print out the data frame
kable(data, caption = var_to_name(data))

```

Next we do some massaging on the datasets

```{r 'data massage'}
# generate id for each row so that we can group the row in line plot
data$id <- seq(nrow(data))

# melt the data frame by keeping id and cluster only
data.melt <- melt(data, id.vars = c('id', 'cluster'))

# the value column is actually the label value in each column, so we rename it as 'label'
names(data.melt)[grepl('value', names(data.melt), fixed = T) %>% which] = 'label'

# change the label to numeric type so that we can plot it in y-axis, it would be our data label
data.melt$value <- as.factor(data.melt$label) %>% as.numeric

dt <- data.table(data.melt)

dt[, variable.x:=as.factor(variable)]

kable(head(dt), caption = var_to_name(dt))

# scale the value group by its column name and use a new column 'scaled' to store it
dt[, scaled.y:=scale(value), variable]

# create the dataset for text labelling
data.label <- dt[, .(variable.x, label, scaled.y)] %>% distinct
```

Start to plot the graph.

```{r 'plotting'}

g = ggplot(dt, aes(x = variable.x, y = scaled.y)) + geom_point(
      aes(color = as.factor(cluster))) + geom_line(aes(group = id, color = as.factor(cluster)))

# vectorize str_wrap function so that we can wrap the text in label column if its too long
wraptxt <- Vectorize(str_wrap, vectorize.args = 'string')

# plot the label
g1 = g + geom_text(data = data.label, aes(x = variable.x, y = scaled.y, label = wraptxt(label, width = 10)),
                                          family = 'Panton', hjust = 'left', nudge_x = .02, size = 3)

g1 + scale_color_discrete(name = 'cluster') + theme_bw() + theme(legend.key = element_rect(color = NA))
```