muvis is a visualization and analysis toolkit for multivariate datasets. To use this package, you will need the R statistical computing environment (version 3.0 or later).
muvis can be installed through github:
# To set BioC as a repo setRepositories(ind = c(1, 2)) devtools::install_github("bAIo-lab/muvis") library(muvis)
The National Health and Nutrition Examination Surveys (NHANES) is a program of studies about health and nutrition for US residents. We examined the functionality of
muvis on NHANES 2005-2006 dataset which contains 7449 variables and 10,348 samples. For more details about variable names and samples visit website- https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/25504/summary
Based on the number of missing values, we selected 161 variables including one ID, 74 continuous, and 86 categorical variables (having two to fifteen levels) and 4461 individuals (samples) aged from 20 to 85 years, including about 1% missing values. Complete list of 161 variables is available in
NHANES dataset of the package.
data("NHANES") # Set ID variable to NULL NHANES$SEQN <- NULL
data_preproc function with
levels = 15 for preprocessing (imputing missing values and specifying categorical and continuous variables) of
detect.outliers option is used to exclude outliers for each variable.
data <- data_preproc(NHANES, levels = 15, detect.outliers = T, alpha = .5) kableExtra::kable(head(data)) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")
graph_vis function to visualize the graphs in this doc. The function also can be used to find communities of an arbitrary graph object.
We construct a Gaussain Graphical Model (GGM) for continuous variables using
ggm function. In this example, we construct it by intersecting
sin algorithms (
levels is set to default
NULL as the data has been preprocessed using
Community number of the first 10 nodes is shown and the largest connected component of the estimated graph is visualized by
nhanes_ggm <- ggm(data, significance = 0.05, rho = 0.15, community = TRUE, methods = c("glasso", "sin"), plot = F, levels = NULL) nhanes_ggm$communities[1:10]
BMXWT BMXHT BMXBMI BMXLEG BMXARML BMXARMC BMXWAIST BMXTHICR 1 1 1 1 1 1 1 1 RIDAGEYR DBD091 18 2
# Visualizing the largest connected component require(igraph) grph_clustrs <- clusters(nhanes_ggm$graph) new_ggm <- induced.subgraph(nhanes_ggm$graph,V(nhanes_ggm$graph)[which(grph_clustrs$membership == which.max(grph_clustrs$csize))]) ggm_vis <- graph_vis(new_ggm, plot = T)
The causal (directed) network of continuous variables is constructed using
dgm function with parameter
dtype = "gaussian". The largest connected component of the estimated graph is shown.
nhanes_dgm <- dgm(data, dtype = "gaussian", alpha = 1e-15, plot = F) # Find the largest connected component grph_clustrs <- clusters(nhanes_dgm$graph) new_dgm <- induced.subgraph(nhanes_dgm$graph, V(nhanes_dgm$graph)[which(grph_clustrs$membership == which.max(grph_clustrs$csize))]) # Visualize the graph dgm_vis <- graph_vis(new_dgm, plot = F, directed = T) dgm_vis$network
min_forest function we estimate the minimal forest with
BIC method and detect communities in the graph. The first ten nodes with the highest betweenness centrality are shown. The estimated minimal forest and two of its communities are also illustrated.
mf <- min_forest(data, stat = "BIC", community = T, plot = F, levels = NULL) mf$betweenness[1:10]
HUQ050 PAD200 PAQ100 PAQ180 HIQ011 RIAGENDR DIQ190B RIDRETH1 6264.000 5820.083 5508.500 5450.000 5450.000 5441.191 4858.167 4160.654 HUQ010 HSD010 3831.000 2799.000
mf.val <- graph_vis(mf$graph, directed = F, community = T, betweenness = T, plot = T, plot.community = T, plot.community.list = c(1,2))
In order to show the functionality of minimal forest in clustering, we select a subsample of size 200 and apply
min_forest function (with parameter
community = T) on the subsampled dataset. Therefore the transposed dataset is passed to the funciton.
t_nhanes <- as.data.frame(sapply(as.data.frame(t(data[1:200, ])), function(x) as.numeric(as.character(x)))) cluster_mf <- min_forest(t_nhanes, plot = T, community = T)
dim_reduce function with
umap methods to reduce dimensionality of the aforementioned subpopulation. The points are colored according to their communities within the minimal forest.
communities <- cluster_mf$communities communities <- communities[match(c(1:200), as.integer(names(communities)))] ## Using 'umap' method ump <- dim_reduce(data[1:200,], method = "umap", annot1 = as.factor(communities), annot1.name = "minimal forest\n communities") ## Using 'tsne' method tsn <- dim_reduce(data[1:200,], method = "tsne", annot1 = as.factor(communities), annot1.name = "minimal forest\n communities") ## Using cowplot to plot with shared legend require(cowplot) require(ggplot2) leg <- get_legend(ump + theme(legend.position = "bottom")) plt <- plot_grid(ump + theme(legend.position = "none"), tsn + theme(legend.position = "none")) plot_grid(plt, leg, ncol = 1, rel_heights = c(1, .2))
Focusing on the variable
PAD590 (TV usage) we select two groups of samples: (i) The participants who watch TV less than an hour (g1) and (ii) those who watch more than 5 hours (g2) a day.
We calculate variable-wise Kullback-Leibler divergence between these two groups using
VKL function (with
permute = 100 to find p-values of the resulted variables).
g1 <- which(data$PAD590 == 1) g2 <- which(data$PAD590 == 6) KL <- VKL(data, group1 = g1, group2 = g2, permute = 100) KL[2:6, ]
KL variable p.value HSD010 0.2581593 HSD010 0.01 PAQ520 0.2193065 PAQ520 0.01 HUQ010 0.2157753 HUQ010 0.01 RIDAGEYR 0.2094270 RIDAGEYR 0.01 LBXALC 0.2066657 LBXALC 0.01
VVKL function with
permute = 100, we find the most important (categorical) variables violating the linear relationship between
LBXTC (total cholesterol) and
LBXVIE (vitamin E). The result suggests
DSD010 (taking dietary supplements) and
DRQSPREP (type of table salt used) as variables with the highest KL-divergence.
KL <- VVKL(data[, 75:160], var1 = data$LBXTC, var2 = data$LBXVIE, plot = T, var1.name = 'LBXTC', var2.name = 'LBXVIE', permute = 100, levels = NULL) kableExtra::kable(head(KL$kl, n = 5)) %>% kable_styling() %>% scroll_box(width = "300px", height = "300px")
levels = 15 to visualize a (pair of) variable(s) in the dataset.
PAD600 (number of hours using computer last month. 0: less than hour, 1: one hour, …, 5: five hour, 6: None):
plot_assoc(data, vars = c("PAD600"), levels = 15, interactive = F)
Density plot of
plot_assoc(data, vars = c("RIDAGEYR"), levels = 15, interactive = F)
LBXTHG (total mercury amount in blood. ug/L) for different answers to
DRD360 (fish eaten during past 30 days. 1: Yes, 2: No, 3: Refused, 4: Don’t know):
signif <- mf$significance signif[signif$edges.from == 'DRD360' & signif$edges.to == 'LBXTHG', ]
edges.from edges.to statistics p.value 105 DRD360 LBXTHG 385.9683 1.574062e-05
plot_assoc(data, vars = c("LBXTHG", "DRD360"), levels = 15, interactive = F)
DSD010 (taking any dietary supplements. 1: yes, 2: no, 3: refused, 4: don’t know) and
SMD410 (smoking in the household 1: yes, 2: no):
signif[signif$edges.from == 'DSD010' & signif$edges.to == 'SMD410', ]
edges.from edges.to statistics p.value 242 DSD010 SMD410 -101.8203 0.5359423
plt <- plot_assoc(data, vars = c("DSD010", "SMD410"))
Scatter plot of
LBXHGB (hemoglobin amount in blood. g/dL) and
LBXSIR (refrigerated iron in blood. ug/dL):
signif[signif$edges.from == 'LBXHGB' & signif$edges.to == 'LBXSIR', ]
edges.from edges.to statistics p.value 60 LBXHGB LBXSIR 802.3598 3.924962e-15
plot_assoc(data, vars = c("LBXHGB", "LBXSIR"), levels = 15, interactive = F)