Authors: statCompTeam@bAIo-lab
Version: 0.1.0
License: GPL-3
A collection of tools to preprocess, analyze, and visualize multivariate datasets, including both continuous and categorical variables, e.g., biological assays, social surveys, questionnaires, etc.
R (>= 3.0)
visNetwork, gRbase, gRim, purrr, gRapHD, permute, graph, SIN, glasso, igraph, stats, entropy, dplyr, limma, smallvis, highcharter, utils, methods, qgraph, pcalg, Rtsne, ggExtra, ggplot2, ggthemes, heatmaply, scales, plotly, AnomalyDetection, ggplotify
packagedocs
Specify categorical and continuous variables and impute the missing values.
data_preproc(data, is.cat = NULL, levels = 5, detect.outliers = FALSE, alpha = 0.5)
is.cat
in NULL. (default = 5)
## Using levels
data("NHANES")
df <- data_preproc(NHANES, levels = 15)
## Using is.cat
require(datasets)
data("mtcars")
l <- logical(11)
l[c(8, 9)] <- TRUE
df <- data_preproc(mtcars, is.cat = l)
## Detect outliers
df <- data_preproc(NHANES, levels = 15, detect.outliers = TRUE, alpha = 0.4)
Elyas Heidari, Vahid Balazadeh
Fit a Gaussian Graphical Model to continuous-valued dataset employing a subset of methods from stepwise AIC, stepwise BIC, stepwise significance test, partial correlation thresholding, edgewise significance test, or glasso. Also visualizes the fitted Graphical Model.
ggm(data, methods = c("glasso"), community = TRUE, betweenness = TRUE, plot = FALSE, levels = NULL, ...)
data
has been preprocessed using data_preproc
and the categorical variables are specified. If it is set, first will run data_preproc
to specify categorical and continuous variables.
The function combines the methods to construct the model, that is, the edge set is the intersection of all edge sets each of which is found by a method. The package gRim is used to implement AIC, BIC, and stepwise significance test. The method glasso from the package glasso is used to provide a sparse estimation of the inverse covariance matrix.
Højsgaard, S., Edwards, D., & Lauritzen, S. (2012). Graphical Models with R. Springer US. https://doi.org/10.1007/978-1-4614-2299-0
Friedman, J., Hastie, T., & Tibshirani, R. (2007). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441. https://doi.org/10.1093/biostatistics/kxm045
Abreu, G. C. G., Edwards, D., & Labouriau, R. (2010). High-Dimensional Graphical Model Search with thegRapHDRPackage. Journal of Statistical Software, 37(1). https://doi.org/10.18637/jss.v037.i01
data("NHANES")
## Using raw data
## No need to choose the continuous variables (They will be detected automatically)
glasso_ggm <- ggm(data = NHANES[1:1000, ], methods = c("glasso"), levels = 15)
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
glasso_sin_ggm <- ggm(data = data[1:1000, 1:74], methods = c("glasso", "sin"),
plot = TRUE, rho = 0.2, significance = 0.03)
Elyas Heidari
Fits a minimal forest to data and visualizes it.
min_forest(data, stat = "BIC", community = TRUE, betweenness = TRUE, plot = FALSE, levels = NULL)
data
has been preprocessed using data_preproc
and the categorical variables are specified. If it is set, first will run data_preproc
to specify categorical and continuous variables.
The function is a wrapper for bnlearn package implementing several algorithms including Constraint-based algorithms (i.e., Max-Min Parents and Children, Semi-Interleaved HITON-PC, and Grow-Shrink), Score-based algorithms (i.e., Hill-Climbing and Tabu Search), and Hybrid algorithms (i.e., Max-Min Hill-Climbing), and Local Discovery algorithms (i.e, Max-Min Parents and Children and ARACNE). If one uses a more than one algorithm, the function combines all of the algorithms and returns a graph based on the combination. The graph is constructed based on the strength of associations calculated by bootstrapping.
Chow, C.K., and Liu, C.N. (1968) Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, Vol. IT-14, 3:462-7.
Edwards, D., de Abreu, G.C.G. and Labouriau, R. (2010). Selecting high- dimensional mixed graphical models using minimal AIC or BIC forests. BMC Bioinformatics, 11:18.
data("NHANES")
## Using raw data
mf <- min_forest(data = NHANES[1:1000, ], stat = "BIC", plot = TRUE, levels = 5)
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
mf <- min_forest(data = data[1:1000, ], stat = "BIC", plot = FALSE)
Elyas Heidari
Reduce dimensionality with a method in {tsne, umap, pca}.
dim_reduce(data, method = "pca", annot1 = NULL, annot1.name = "annot1", annot2 = NULL, annot2.name = "annot2", levels = NULL)
data
has been preprocessed using data_preproc
and the categorical variables are specified. If it is set, first will run data_preproc
to specify categorical and continuous variables.
data("NHANES")
## Using different methods on the raw data
df <- NHANES[sample(nrow(NHANES), 500), ]
plt_pca <- dim_reduce(df, method = "pca", levels = 15)
plt_tsne <- dim_reduce(df, method = "tsne", annot1 = df$BMXBMI, annot1.name = "BMI", levels = 15)
plt_umap <- dim_reduce(df, method = "umap", annot1 = df$LBXTC, annot1.name = "Total Cholesterol",
annot2 = as.factor(df$RIAGENDR), annot2.name = "Gender", levels = 15)
Elyas Heidari
A wrapper of fci
function in pcalg
package. Estimates a simplified Partial Ancestral Graph (PAG) using FCI algorithm.
dgm(data, alpha = 1e-05, dtype = "gaussian", community = TRUE, betweenness = TRUE, plot = FALSE, levels = NULL)
data
has been preprocessed using data_preproc
and the categorical variables are specified. If it is set, first will run data_preproc
to specify categorical and continuous variables.
There is no specific distribution needed for the data. The parameter dtype will be used for determining the data type to be provided as the input of the function. However, it is highly recommended to use “guassian” data type for both continuous and ordinal discrete data.
D. Colombo and M.H. Maathuis (2014).Order-independent constraint-based causal structure learning. Journal of Machine Learning Research 15 3741-3782.
D. Colombo, M. H. Maathuis, M. Kalisch, T. S. Richardson (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Statist. 40, 294-321.
M. Kalisch and P. Buehlmann (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm, JMLR 8 613-636.
P. Spirtes, C. Glymour and R. Scheines (2000). Causation, Prediction, and Search, 2nd edition, MIT Press.
data("NHANES")
## Using raw data
## Using "gaussian" method for continuous data
gaussian_dgm <- dgm(data = NHANES[1:1000, ], dtype = "gaussian", levels = 10)
## Using "discrete" method for categorical data
discrete_dgm <- dgm(data = NHANES[1000:1200, ], dtype = "discrete", levels = 5)
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
prep_gauss_dgm <- dgm(data = data[1:1000, ], plot = TRUE)
Elyas Heidari
Uses min_forest
community detection and selects a number of variables from each community as representative.
find_repres(data, ratio = 0.1, weighted = FALSE, levels = NULL)
data
has been preprocessed using data_preproc
and the categorical variables are specified. If it is set, first will run data_preproc
to specify categorical and continuous variables.
The function uses min_forest
to detect communities of variables. It sorts each community nodes based on their degree and selects an equal number of each based on ratio
parameter. If weighted
is TRUE it will select different number of nodes from each community based on their sizes.
data("NHANES")
## Using raw data
repres_vars <- find_repres(data = NHANES[1:1000, ], ratio = .2, levels = 10)
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
## With \code{weighted = TRUE}
repres_vars <- find_repres(data = data[1:1000, ], weighted = TRUE)
Vahid Balazadeh, Elyas Heidari
Converts the graph to an igraph object, finds communities and plots it using qgraph package.
graph_vis(graph, directed = F, community = T, betweenness = T, plot = F, ...)
## Visualize Harman23.cor covariance matrix
require(datasets)
data("Harman23.cor")
gv <- graph_vis(Harman23.cor$cov, plot = TRUE, plot.community = TRUE, plot.community.list = c(1, 2))
Elyas Heidari, Vahid Balazadeh
Calculates variable-wise Kullback-Leibler divergence between the two groups of samples.
VKL(data, group1, group2, permute = 0, levels = NULL)
data
has been preprocessed using data_preproc
and the categorical variables are specified.
The function helps users to find out the variables with the most divergence between two groups with different states of one specific variable. For instance, within a dataset of health measurements, we are interested in finding the most important variables in occurring cardiovascular disease. The function is able to carry out the permutation test to calculate the p_value for each variable.
data("NHANES")
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
# Construct two groups of samples
g1 <- which(data$PAD590 == 1)
g2 <- which(data$PAD590 == 6)
# Set permute to calculate p.values
kl <- VKL(data, group1 = g1, group2 = g2, permute = 100, levels = NULL)
## Using raw data
kl <- VKL(NHANES, group1 = g1, group2 = g2, permute = 0, levels = 15)
Elyas Heidari
Calculates variable-wise Kullback-Leibler divergence between the two groups of samples which violate the linear relationship between two continuous variables.
VVKL(data, var1, var2, permute = 0, levels = NULL, plot = F, var1.name = "var1", var2.name = "var2")
data
has been preprocessed using data_preproc
and the categorical variables are specified.
data("NHANES")
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
# Set permute to calculate p.values
kl <- VVKL(data, var1 = data$LBXTC, var2 = data$LBXVIE, permute = 100, levels = NULL)
## Using raw data and plot
kl <- VVKL(NHANES, var1 = data$LBXTC, var2 = data$LBXVIE, permute = 0, levels = 15,
plot = TRUE, var1.name = 'LBXTC', var2.name = 'LBXVIE')
Elyas Heidari
Tests for association between two variables:
test_pair(data, var1, var2, levels = NULL)
data
has been preprocessed using data_preproc
and the categorical variables are specified.
This provides a wrapper to chisq.test
, cor.test
, aov
from stats
package to test association between two variables
## Preprocess the data
data("NHANES")
data <- data_preproc(NHANES, levels = 15)
## Find test p.values for:
## One continuous and one categorical variable
cont_cat_test <- test_pair(data, var1 = "LBXTC", var2 = "RIAGENDR")
## Two continuous variables
cont_cont_test <- test_pair(data, var1 = "LBXTC", var2 = "LBXVIE")
## Two categorical variables
cat_cat_test <- test_pair(data, var1 = "DIQ010", var2 = "SMD410")
Elyas Heidari
Tests for association between each paired variables:
test_assoc(data, vars, levels = NULL, plot = FALSE)
data
has been preprocessed using data_preproc
and the categorical variables are specified.
This provides a wrapper to chisq.test
, cor.test
, aov
, p.adjust
from stats
package to test association between variables And a wrapper to heatmaply
package to construct heatmap.
data("NHANES")
## Using raw data
df <- NHANES[1:1000, ]
test_matrix <- test_assoc(data = df, vars = colnames(df), plot = FALSE, levels = 15)
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
## Outputs the heatmap too (plot = TRUE)
test_mat_heatmap <- test_assoc(data = data, vars = colnames(data[, 1:20]), plot = TRUE)
Elyas Heidari
Plots one or a pair of variables (non) interactively using ggplot2 and highcharter packages.
plot_assoc(data, vars, levels = NULL, interactive = FALSE)
data
has been preprocessed using data_preproc
and the categorical variables are specified.
(Plots interactively If interactive = TRUE).
## Preprocess the data
data("NHANES")
data <- data_preproc(NHANES, levels = 15)
## Plot (non)interactive for:
## One categorical variable
pt1 <- plot_assoc(data, vars = "PAD600")
pt2 <- plot_assoc(data, vars = "SMD410", interactive = TRUE)
## One continuous variable
pt3 <- plot_assoc(data, vars = "LBXTC")
pt4 <- plot_assoc(data, vars = "BMXBMI", interactive = TRUE)
## One continuous and one categorical variable
pt5 <- plot_assoc(data, vars = c("LBXTC", "RIAGENDR"))
pt6 <- plot_assoc(data, vars = c("BMXBMI", "PAD600"), interactive = TRUE)
## Two continuous variables
pt7 <- plot_assoc(data, vars = c("LBXTC", "BMXBMI"))
pt8 <- plot_assoc(data, vars = c("LBXVIE", "LBXVIC"), interactive = TRUE)
## Two categorical variables
pt9 <- plot_assoc(data, vars = c("SMD410", "PAD600"))
pt10 <- plot_assoc(data, vars = c("PAD600", "SMD410"), interactive = TRUE)
## With raw data
pt11 <- plot_assoc(NHANES, vars = "RIDAGEYR", levels = 15)
Elyas Heidari, Vahid Balazadeh