multivariate analysis and visualization

Authors: statCompTeam@bAIo-lab

Version: 0.1.0

License: GPL-3

Description

A collection of tools to preprocess, analyze, and visualize multivariate datasets, including both continuous and categorical variables, e.g., biological assays, social surveys, questionnaires, etc.

Depends

R (>= 3.0)

Imports

visNetwork, gRbase, gRim, purrr, gRapHD, permute, graph, SIN, glasso, igraph, stats, entropy, dplyr, limma, smallvis, highcharter, utils, methods, qgraph, pcalg, Rtsne, ggExtra, ggplot2, ggthemes, heatmaply, scales, plotly, AnomalyDetection, ggplotify

Suggests

packagedocs

Package Functions

data_preproc

preprocess the data

Specify categorical and continuous variables and impute the missing values.

Usage

data_preproc(data, is.cat = NULL, levels = 5, detect.outliers = FALSE, alpha = 0.5)

Arguments

data: A data.frame object or a matrix.
is.cat: A boolean vector specifies which variables are categorical. (default = NULL)
levels: An integer number indicates the maximum levels of categorical variables. It is used when is.cat in NULL. (default = 5)
detect.outliers: Logical indicating if data outliers should be detected. If TRUE outliers will be treated as NA. Defaults to FALSE.
alpha: A number between (0, 1). Rows where the ratio of the NA values in them is more than alpha will be deleted.

Value

A normalized data.frame object with specified continuous and (or) categorical variables and no missing values.

Examples

## Using levels
data("NHANES")
df <- data_preproc(NHANES, levels = 15)

## Using is.cat
require(datasets)
data("mtcars")
l <- logical(11)
l[c(8, 9)] <- TRUE
df <- data_preproc(mtcars, is.cat = l)

## Detect outliers
df <- data_preproc(NHANES, levels = 15, detect.outliers = TRUE, alpha = 0.4)

Author

Elyas Heidari, Vahid Balazadeh

ggm

construct and visualize Gaussian Graphical Models.

Fit a Gaussian Graphical Model to continuous-valued dataset employing a subset of methods from stepwise AIC, stepwise BIC, stepwise significance test, partial correlation thresholding, edgewise significance test, or glasso. Also visualizes the fitted Graphical Model.

Usage

ggm(data, methods = c("glasso"), community = TRUE, betweenness = TRUE, plot = FALSE, levels = NULL, ...)

Arguments

data: A normalized dataframe or matrix with no missing data of continuous measurements.
methods: A string or list of strings indicate methods used to construct the model. See the details for more information. (default = “glasso”)
community: A logical value to show if the node communities should be detected and colored in the returned graph. (default = TRUE)
betweenness: A logical value to show if the node betweenness measurements should be computed and returned from the function. (default = TRUE)
plot: A logical value to show if the graph should be plotted. (default = FALSE)
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified. If it is set, first will run data_preproc to specify categorical and continuous variables.
…: Any additional arguments described below.

Value

significance: A data.frame containing edges with p.values.
graph: an igraph object of the graphical model.
betweenness: betweenness measurements of each node.
network: a visNetwork plot of the graph.
communities: a named vector indicating the community of each node.

Details

The function combines the methods to construct the model, that is, the edge set is the intersection of all edge sets each of which is found by a method. The package gRim is used to implement AIC, BIC, and stepwise significance test. The method glasso from the package glasso is used to provide a sparse estimation of the inverse covariance matrix.

Additional arguments

threshold: A threshold for partial correlation thresholding method (default = 0.05). To be used only when the method “threshold” is used.
significance: A cutoff for edge significance (default = 0.05). To be used only when the method “significance” is used.
rho: (Non-negative) regularization parameter for glasso (default = 0.1). To be used only when the method “glasso” is used.

References

Højsgaard, S., Edwards, D., & Lauritzen, S. (2012). Graphical Models with R. Springer US. https://doi.org/10.1007/978-1-4614-2299-0

Friedman, J., Hastie, T., & Tibshirani, R. (2007). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441. https://doi.org/10.1093/biostatistics/kxm045

Abreu, G. C. G., Edwards, D., & Labouriau, R. (2010). High-Dimensional Graphical Model Search with thegRapHDRPackage. Journal of Statistical Software, 37(1). https://doi.org/10.18637/jss.v037.i01

Examples

data("NHANES")
## Using raw data
## No need to choose the continuous variables (They will be detected automatically)
glasso_ggm <- ggm(data = NHANES[1:1000, ], methods = c("glasso"), levels = 15)

## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
glasso_sin_ggm <- ggm(data = data[1:1000, 1:74], methods = c("glasso", "sin"),
plot = TRUE, rho = 0.2, significance = 0.03)

Author

Elyas Heidari

min_forest

construct and visualize a minimal forest based on Chow-Liu algorithm

Fits a minimal forest to data and visualizes it.

Usage

min_forest(data, stat = "BIC", community = TRUE, betweenness = TRUE, plot = FALSE, levels = NULL)

Arguments

data: A normalized dataframe or matrix with no missing data of continuous and (or) categorical measurements.
stat: Measure to be minimized: LR, AIC, or BIC (the default). The default is BIC. It can also be a user-defined function with the format: FUN (model, dataset, previous, forbEdges); where the parameters are defined as in chStat. The function must return a structure as in chStat.
community: A logical value to show if the node communities should be detected and colored in the returned graph. (default = TRUE)
betweenness: A logical value to show if the node betweenness measurements should be computed and returned from the function. (default = TRUE)
plot: A logical value to show if the graph should be plotted. (default = FALSE)
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified. If it is set, first will run data_preproc to specify categorical and continuous variables.

Value

significanse: A data.frame containing edges with p-statistics and p.values.
summary: a gRapHD object which is the fit model.
graph: an igraph object.
betweenness: betweenness measurements of each node.
network: a visNetwork plot of the graph.
communities: a named vector indicating the community of each node.

Details

The function is a wrapper for bnlearn package implementing several algorithms including Constraint-based algorithms (i.e., Max-Min Parents and Children, Semi-Interleaved HITON-PC, and Grow-Shrink), Score-based algorithms (i.e., Hill-Climbing and Tabu Search), and Hybrid algorithms (i.e., Max-Min Hill-Climbing), and Local Discovery algorithms (i.e, Max-Min Parents and Children and ARACNE). If one uses a more than one algorithm, the function combines all of the algorithms and returns a graph based on the combination. The graph is constructed based on the strength of associations calculated by bootstrapping.

References

Chow, C.K., and Liu, C.N. (1968) Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, Vol. IT-14, 3:462-7.

Edwards, D., de Abreu, G.C.G. and Labouriau, R. (2010). Selecting high- dimensional mixed graphical models using minimal AIC or BIC forests. BMC Bioinformatics, 11:18.

Examples

data("NHANES")
## Using raw data
mf <- min_forest(data = NHANES[1:1000, ], stat = "BIC", plot = TRUE, levels = 5)

## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
mf <- min_forest(data = data[1:1000, ], stat = "BIC", plot = FALSE)

Author

Elyas Heidari

dim_reduce

Dimensionality reduction and visualization.

Reduce dimensionality with a method in {tsne, umap, pca}.

Usage

dim_reduce(data, method = "pca", annot1 = NULL, annot1.name = "annot1", annot2 = NULL, annot2.name = "annot2", levels = NULL)

Arguments

data: A normalized dataframe or matrix with no missing data to be reduced in dimension.
method: A character string as the name of the method. Available values are “pca” (the default), “tsne”, “umap”.
annot1: A vector of continuous or factor values to color samples in the resulted plot (the order of values should be the same as the order of rows in data). Default to NULL.
annot1.name: The name of the variable indicating annot1 vector. Defaults to “annot1”.
annot2: A vector of factor values indicating sample shapes to plot (the order of values should be the same as the order of rows in data). Default to NULL.
annot2.name: The name of the variable indicating annot2 vector. Defaults to “annot2”.
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified. If it is set, first will run data_preproc to specify categorical and continuous variables.

Value

A plot of data points in the 2 dimensional space.

Examples

data("NHANES")

## Using different methods on the raw data
df <- NHANES[sample(nrow(NHANES), 500), ]
plt_pca <- dim_reduce(df, method = "pca", levels = 15)
plt_tsne <- dim_reduce(df, method = "tsne", annot1 = df$BMXBMI, annot1.name = "BMI", levels = 15)
plt_umap <- dim_reduce(df, method = "umap", annot1 = df$LBXTC, annot1.name = "Total Cholesterol",
annot2 = as.factor(df$RIAGENDR), annot2.name = "Gender", levels = 15)

Author

Elyas Heidari

dgm

construct and visualize causal (directed) networks

A wrapper of fci function in pcalg package. Estimates a simplified Partial Ancestral Graph (PAG) using FCI algorithm.

Usage

dgm(data, alpha = 1e-05, dtype = "gaussian", community = TRUE, betweenness = TRUE, plot = FALSE, levels = NULL)

Arguments

data: A normalized dataframe or matrix with no missing data.
alpha: significance level (number in (0,1)) for the individual conditional independence tests. Defaults to 1e-5.
dtype: Either “guassian” for continuous/ordinal discrete data or “discrete” for discrete/nominal data. (default = “gaussian”) See details for more description.
community: A logical value to show if the node communities should be detected and colored in the returned graph. (default = TRUE)
betweenness: A logical value to show if the node betweenness measurements should be computed and returned from the function. (default = TRUE)
plot: A logical value to show if the graph should be plotted. (default = FLASE)
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified. If it is set, first will run data_preproc to specify categorical and continuous variables.

Value

significance: A data.frame containing edges with p.values.
graph: an igraph object of the graphical model.
betweenness: betweenness measurements of each node.
communities: a named vector indicating the community of each node.
network: a visNetwork plot of the graphical model.

Details

There is no specific distribution needed for the data. The parameter dtype will be used for determining the data type to be provided as the input of the function. However, it is highly recommended to use “guassian” data type for both continuous and ordinal discrete data.

References

D. Colombo and M.H. Maathuis (2014).Order-independent constraint-based causal structure learning. Journal of Machine Learning Research 15 3741-3782.

D. Colombo, M. H. Maathuis, M. Kalisch, T. S. Richardson (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Statist. 40, 294-321.

M. Kalisch and P. Buehlmann (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm, JMLR 8 613-636.

P. Spirtes, C. Glymour and R. Scheines (2000). Causation, Prediction, and Search, 2nd edition, MIT Press.

Examples

data("NHANES")
## Using raw data
## Using "gaussian" method for continuous data
gaussian_dgm <- dgm(data = NHANES[1:1000, ], dtype = "gaussian", levels = 10)

## Using "discrete" method for categorical data
discrete_dgm <- dgm(data = NHANES[1000:1200, ], dtype = "discrete", levels = 5)

## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
prep_gauss_dgm <- dgm(data = data[1:1000, ], plot = TRUE)

Author

Elyas Heidari

find_repres

Find representative variables.

Uses min_forest community detection and selects a number of variables from each community as representative.

Usage

find_repres(data, ratio = 0.1, weighted = FALSE, levels = NULL)

Arguments

data: A normalized dataframe or matrix with no missing data of continuous and (or) categorical measurements.
ratio: A number between (0, 1) indicating the ratio of data variables to be chosen as representative. Defaults to 0.1.
weighted: A logical indicating if the number of nodes selected from each community is weighted. Defaults to FALSE. See details for more information.
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified. If it is set, first will run data_preproc to specify categorical and continuous variables.

Value

a vector containing names of representative variables.

Details

The function uses min_forest to detect communities of variables. It sorts each community nodes based on their degree and selects an equal number of each based on ratio parameter. If weighted is TRUE it will select different number of nodes from each community based on their sizes.

Examples

data("NHANES")
## Using raw data
repres_vars <- find_repres(data = NHANES[1:1000, ], ratio = .2, levels = 10)

## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
## With \code{weighted = TRUE}
repres_vars <- find_repres(data = data[1:1000, ], weighted = TRUE)

Author

Vahid Balazadeh, Elyas Heidari

graph_vis

graph visualization and community detection

Converts the graph to an igraph object, finds communities and plots it using qgraph package.

Usage

graph_vis(graph, directed = F, community = T, betweenness = T, plot = F, ...)

Arguments

graph: An arbitrary graph object in R.
directed: Set it TRUE when the graph is directed. (default = FALSE)
community: A logical value to show if the node communities should be detected and colored in the returned graph. (default = TRUE)
betweenness: A logical value to show if the node betweenness measurements should be computed and returned from the function. (default = TRUE)
plot: A logical value to show if the graph should be plotted. (default = FALSE)
…: Any additional arguments described below.

Value

graph: an igraph object.
betweenness: betweenness measurements of each node.
network: a visNetwork plot of the graph.
communities: a named vector indicating the community of each node.

Additional arguments

groups: A list that indicates which community each node is. The automatic community detection will be ignored when it is set.
plot.community: Logical indicating if communities should be plotted. Defaults to FALSE.
filename: Name of the plot file without extension. (qgraph function argument)
filetype: A character indicates the file type to save the plots in. (qgraph function argument)
plot.community.list: A list indicates which communities should be plotted. When is not set, will plot all the communities.

Examples

## Visualize Harman23.cor covariance matrix
require(datasets)
data("Harman23.cor")
gv <- graph_vis(Harman23.cor$cov, plot = TRUE, plot.community = TRUE, plot.community.list = c(1, 2))

Author

Elyas Heidari, Vahid Balazadeh

VKL

Calculate Variable-wise Kullback-Leibler divergence

Calculates variable-wise Kullback-Leibler divergence between the two groups of samples.

Usage

VKL(data, group1, group2, permute = 0, levels = NULL)

Arguments

data: A numerical dataframe with no missing value.
group1: A vector of integers. Demonstrates the row indices of group 1.
group2: A vector of integers. Demonstrates the row indices of group 2.
permute: An integer indicating the number of permutations for permutation test. If 0 (the default) no permutation test will be carried out.
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified.

Value

if permute = 0 returns a dataframe including sorted Kullback-Liebler (KL) divergence. if permute > 0 returns a dataframe including p.values and sorted KL divergence.

Details

The function helps users to find out the variables with the most divergence between two groups with different states of one specific variable. For instance, within a dataset of health measurements, we are interested in finding the most important variables in occurring cardiovascular disease. The function is able to carry out the permutation test to calculate the p_value for each variable.

Examples

data("NHANES")
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
# Construct two groups of samples
g1 <- which(data$PAD590 == 1)
g2 <- which(data$PAD590 == 6)
# Set permute to calculate p.values
kl <- VKL(data, group1 = g1, group2 = g2, permute = 100, levels = NULL)

## Using raw data
kl <- VKL(NHANES, group1 = g1, group2 = g2, permute = 0, levels = 15)

Author

Elyas Heidari

VVKL

Calculate Violating Variable-wise Kullback-Leibler divergence

Calculates variable-wise Kullback-Leibler divergence between the two groups of samples which violate the linear relationship between two continuous variables.

Usage

VVKL(data, var1, var2, permute = 0, levels = NULL, plot = F, var1.name = "var1", var2.name = "var2")

Arguments

data: A numeric dataframe including no missing value
var1: A vector of continuous values indicating the first variable. (the order of values should be the same as the order of rows in data)
var2: A vector of continuous values indicating the second variable. (the order of values should be the same as the order of rows in data)
permute: An integer indicating the number of permutations for permutation test. If 0 (the default) no permutation test will be carried out.
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified.
plot: A logical indicating if the scatter plot of two variables is returned. It highlights the two outlier groups with different colors.
var1.name: The name of the first variable to be shown on the plot.
var2.name: The name of the second variable to be shown on the plot.

Value

If permute = 0 returns a dataframe including sorted Kullback-Liebler (KL) divergence. If permute > 0 returns a dataframe including p.values and sorted KL divergence.

Examples

data("NHANES")
## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
# Set permute to calculate p.values
kl <- VVKL(data, var1 = data$LBXTC, var2 = data$LBXVIE, permute = 100, levels = NULL)

## Using raw data and plot
kl <- VVKL(NHANES, var1 = data$LBXTC, var2 = data$LBXVIE, permute = 0, levels = 15,
plot = TRUE, var1.name = 'LBXTC', var2.name = 'LBXVIE')

Author

Elyas Heidari

test_pair

Test for association between two paired variables.

Tests for association between two variables:

Categorical-Categorical: Using pearson’s chi-squared test (chisq.test function from stats package).
Continuous-Continuous: Using correlation test (cor.test function from stats package).
Categorical-Continuous: Using analysis of variance model (aov function from stats package).

Usage

test_pair(data, var1, var2, levels = NULL)

Arguments

data: A dataframe. It is strongly recommended that the dataframe has no missing data and is preprocessed.
var1: First variable
var2: Second variable
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified.

Value

P.value of test between the two variables.

Details

This provides a wrapper to chisq.test, cor.test, aov from stats package to test association between two variables

Examples

## Preprocess the data
data("NHANES")
data <- data_preproc(NHANES, levels = 15)
## Find test p.values for:
## One continuous and one categorical variable
cont_cat_test <- test_pair(data, var1 = "LBXTC", var2 = "RIAGENDR")
## Two continuous variables
cont_cont_test <- test_pair(data, var1 = "LBXTC", var2 = "LBXVIE")

## Two categorical variables
cat_cat_test <- test_pair(data, var1 = "DIQ010", var2 = "SMD410")

Author

Elyas Heidari

test_assoc

Test for association between each paired variables and construct heatmap using heatmaply package.

Tests for association between each paired variables:

Categorical-Categorical: Using pearson’s chi-squared test (chisq.test function from stats package).
Continuous-Continuous: Using correlation test (cor.test function from stats package).
Categorical-Continuous: Using analysis of variance model (aov function from stats package).

Also adjusts p.values using Benjamini & Hochberg method (p.adjust function from stats package) and constructs heatmap using heatmaply function.

Usage

test_assoc(data, vars, levels = NULL, plot = FALSE)

Arguments

data: a dataframe. It is strongly recommended that the dataframe has no missing data and is preprocessed.
vars: a list including the name (or index) of columns of data.
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified.
plot: Logical indicating if the heatmap should be constructed. Defaults to FALSE.

Value

matrix: A matrix containing p.values of tests between each two variables.
heatmap: A plotly object containing heatmap related to matrix.

Details

This provides a wrapper to chisq.test, cor.test, aov, p.adjust from stats package to test association between variables And a wrapper to heatmaply package to construct heatmap.

Examples

data("NHANES")
## Using raw data
df <- NHANES[1:1000, ]
test_matrix <- test_assoc(data = df, vars = colnames(df), plot = FALSE, levels = 15)

## Using preprocessed data
data <- data_preproc(NHANES, levels = 15)
data$SEQN <- NULL
## Outputs the heatmap too (plot = TRUE)
test_mat_heatmap <- test_assoc(data = data, vars = colnames(data[, 1:20]), plot = TRUE)

Author

Elyas Heidari

plot_assoc

Plot variables.

Plots one or a pair of variables (non) interactively using ggplot2 and highcharter packages.

Usage

plot_assoc(data, vars, levels = NULL, interactive = FALSE)

Arguments

data: A dataframe. It is strongly recommended that the dataframe has no missing data and is preprocessed.
vars: A vector of length one or two including the name (or index) of a (row) column(s) of data.
levels: An integer value indicating the maximum number of levels of a categorical variable. To be used to distinguish the categorical variable. Defaults to NULL because it is supposed that data has been preprocessed using data_preproc and the categorical variables are specified.
interactive: Logical indicating if the output should be interactive. Defaults to FALSE.

Value

One categorical variable: Plots the barplot of the variable.
One continuous variable: Plots the density (histogram) plot of the variable.
A categorical and a continuous variable: Plots a boxplot of the continuous variable for different levels of the categorical variable.
Two continuous variables: Plots a scatter plot of two variables.
Two categorical variables: Plots a relative histogram showing distribution of one variable for each level of the other.

Examples

## Preprocess the data
data("NHANES")
data <- data_preproc(NHANES, levels = 15)

## Plot (non)interactive for:
## One categorical variable
pt1 <- plot_assoc(data, vars = "PAD600")
pt2 <- plot_assoc(data, vars = "SMD410", interactive = TRUE)

## One continuous variable
pt3 <- plot_assoc(data, vars = "LBXTC")
pt4 <- plot_assoc(data, vars = "BMXBMI", interactive = TRUE)

## One continuous and one categorical variable
pt5 <- plot_assoc(data, vars = c("LBXTC", "RIAGENDR"))
pt6 <- plot_assoc(data, vars = c("BMXBMI", "PAD600"), interactive = TRUE)
##  Two continuous variables
pt7 <- plot_assoc(data, vars = c("LBXTC", "BMXBMI"))
pt8 <- plot_assoc(data, vars = c("LBXVIE", "LBXVIC"), interactive = TRUE)

## Two categorical variables
pt9 <- plot_assoc(data, vars = c("SMD410", "PAD600"))
pt10 <- plot_assoc(data, vars = c("PAD600", "SMD410"), interactive = TRUE)

## With raw data
pt11 <- plot_assoc(NHANES, vars = "RIDAGEYR", levels = 15)

Author

Elyas Heidari, Vahid Balazadeh