Title: | Helper Tools for Teaching Statistical Data Analysis |
---|---|
Description: | Provides functions and data-sets that are helpful for teaching statistics and data analysis. It was originally designed for use when teaching students in the Psychology Department at Nottingham Trent University. |
Authors: | Mark Andrews [aut, cre], Jens Roeser [aut] |
Maintainer: | Mark Andrews <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2024-11-13 09:19:18 UTC |
Source: | https://github.com/mark-andrews/psyntur |
Data on the height, weight, handedness from men and women of different ages and different races.
ansur
ansur
A data frame with 6068 observations from 9 variables.
Unique ID of the person
Binary variable indicating the subject's sex: male
or female
.
Height in centimeters.
Weight in kilograms.
Categorical variable indicating if the person is left, or right handed, or both.
Age in years
Race, with categories like white
, black
, hispanic
.
The tercile of the person's height.
The tercile of the person's weight.
This data is a transformed version of data sets obtained the Anthropometric Survey of US Army Personnel (ANSUR 2 or ANSUR II).
This is wrapper to the effsize::cohen.d()
function.
cohen_d(...)
cohen_d(...)
... |
A comma separated list of arguments. See |
A list of class effsize
as returned by effsize::cohen.d()
.
cohen_d(weight ~ gender, data = ansur) cohen_d(age ~ gender, data = schizophrenia)
cohen_d(weight ~ gender, data = ansur) cohen_d(age ~ gender, data = schizophrenia)
This function is a wrapper around stats::cor.test()
.
It implements the Pearson's correlation test that tests the null hypothesis
that two paired samples of values are unrelated.
This function must be applied to two numeric vectors.
cor_test(x, y, method = "pearson", data)
cor_test(x, y, method = "pearson", data)
x |
A numeric variable. |
y |
A numeric variable. |
method |
A character string indicating which correlation coefficient is to be used: "pearson", "kendall", or "spearman". Default method is "pearson". |
data |
A data frame containing the |
A tibble data frame with the correlation statistic, and the corresponding p-value.
cor_test(y = sex_dimorph, x = attractive, data = faithfulfaces) cor_test(y = sex_dimorph, x = attractive, method = "spearman", data = faithfulfaces)
cor_test(y = sex_dimorph, x = attractive, data = faithfulfaces) cor_test(y = sex_dimorph, x = attractive, method = "spearman", data = faithfulfaces)
This function is a wrapper around stats::cor.test()
.
It implements the Pearson's correlation test that tests the null hypothesis
that two or more paired samples of values are unrelated.
This function can be applied to two or more numeric variables in the provided data.
cor_test_multi( .data, ..., .pvalues = FALSE, .ci = FALSE, .as_matrix = TRUE, .omit_redundancies = FALSE, .method = "pearson" )
cor_test_multi( .data, ..., .pvalues = FALSE, .ci = FALSE, .as_matrix = TRUE, .omit_redundancies = FALSE, .method = "pearson" )
.data |
A data frame. |
... |
Variables for which the correlation coefficient should be returned.
If no variable name is provided, correlations will be returned for all numeric
variables in |
.pvalues |
logical If FALSE (default), p-values will be omitted from the output. If TRUE, p-values will be included in the output. |
.ci |
logical If FALSE (default), 95% confidence interval bounds will be omitted from the output. If TRUE, 95% confidence interval bounds will be included in the output. |
.as_matrix |
logical If TRUE (default), results will be return as matrix. If TRUE, results will be returned as tibble. |
.omit_redundancies |
logical If FALSE (default), all n^2 correlations will be include in the output. If TRUE, only unique correlations will be returned (x ~ y but not y ~ x) and correlation of a variable with itself will be omitted. |
.method |
A character string indicating which correlation coefficient is to be used: "pearson", "kendall", or "spearman". Default method is "pearson". |
By default a matrix with correlation coefficients. Output format and included statistics can be changed in the argument settings.
# Calculate the correlations between all numeric variables in the `faithfulfaces` data. cor_test_multi(faithfulfaces) # Calculate the correlations between the 1st, 2nd and 4th variable. cor_test_multi(faithfulfaces, c(1,2,4)) # Calculate the correlations between `sex_dimorph`, `attractive`, and `trustworthy`. cor_test_multi(faithfulfaces, sex_dimorph, attractive, trustworthy) # Calculate all correlations and return p-values and 95% confidence intervals. cor_test_multi(faithfulfaces, .pvalues = TRUE, .ci = TRUE) # Calculate all correlations with p-values and 95% confidence intervals and # return results as table with only unique pairs of the off-diagonal correlations. cor_test_multi(faithfulfaces, .pvalues = TRUE, .ci = TRUE, .as_matrix = FALSE, .omit_redundancies = TRUE)
# Calculate the correlations between all numeric variables in the `faithfulfaces` data. cor_test_multi(faithfulfaces) # Calculate the correlations between the 1st, 2nd and 4th variable. cor_test_multi(faithfulfaces, c(1,2,4)) # Calculate the correlations between `sex_dimorph`, `attractive`, and `trustworthy`. cor_test_multi(faithfulfaces, sex_dimorph, attractive, trustworthy) # Calculate all correlations and return p-values and 95% confidence intervals. cor_test_multi(faithfulfaces, .pvalues = TRUE, .ci = TRUE) # Calculate all correlations with p-values and 95% confidence intervals and # return results as table with only unique pairs of the off-diagonal correlations. cor_test_multi(faithfulfaces, .pvalues = TRUE, .ci = TRUE, .as_matrix = FALSE, .omit_redundancies = TRUE)
This function calculates the Cronbach alpha for one or more sets of psychometric scale items. Each item is a variable in a data frame. Each set of items is defined by a tidy selection of a set of items.
cronbach(.data, ..., .ci = 0.95)
cronbach(.data, ..., .ci = 0.95)
.data |
A data frame with columns that are psychometric items. |
... |
A set of comma separated tidy selectors that selects sets of
columns from |
.ci |
The value of the confidence interval to calculate. |
A data frame whose rows are psychometric scales and for each scale, we have the Cronbach's alpha, and the lower and upper bound of the confidence interval on alpha.
# Return the Cronbach alpha and 95% ci for two scales. # The first scale, named `x`, is identified by all items beginning with `x_`. # The second scale, named `y`, is identified by the consecutive items from `y_1` to `y_10`. cronbach(test_psychometrics, x = starts_with('x'), y = y_1:y_10)
# Return the Cronbach alpha and 95% ci for two scales. # The first scale, named `x`, is identified by all items beginning with `x_`. # The second scale, named `y`, is identified by the consecutive items from `y_1` to `y_10`. cronbach(test_psychometrics, x = starts_with('x'), y = y_1:y_10)
This is a wrapper to the typical ggplot
based density plot, i.e., using
geom_density
. A continuous variable, x
, is required as an input.
Optionally, a by
categorical variable can be provided.
densityplot( x, data, by = NULL, position = "stack", facet = NULL, facet_type = "wrap", alpha = 1, xlab = NULL, ylab = NULL )
densityplot( x, data, by = NULL, position = "stack", facet = NULL, facet_type = "wrap", alpha = 1, xlab = NULL, ylab = NULL )
x |
The numeric variable that is to be density plotted. |
data |
A data frame with at least one numeric variable (the |
by |
A categorical variable by which to group the |
position |
If the |
facet |
A character string or character vector. If provided, we
|
facet_type |
By default, this takes the value of |
alpha |
The transparency to for the filled histogram bars. This is
probably only required when using |
xlab |
The label of the x-axis (defaults to the |
ylab |
The label of the y-axis (defaults to the |
A ggplot2::ggplot
object, which may be modified with further ggplot2
commands.
densityplot(x = age, data = schizophrenia, by = gender)
densityplot(x = age, data = schizophrenia, by = gender)
This function is a lightweight wrapper to dplyr
's summarize
function.
It can be used to calculate any descriptive or summary statistic for any
variable in the data set. Optionally, a by
grouping variable can be used,
and then the summary statistics are calculated for each subgroup defined by
the different values of the by
variable.
describe(data, by = NULL, ...)
describe(data, by = NULL, ...)
data |
A data frame |
by |
A grouping variable. If included, the |
... |
Arguments of functions applied to variables, e.g. |
A tibble data frame with each row providing descriptive statistics
for selected variables for each value of the grouping by
variable.
describe(faithfulfaces, avg = mean(faithful), stdev = sd(faithful)) describe(faithfulfaces, by = face_sex, avg = mean(faithful), stdev = sd(faithful))
describe(faithfulfaces, avg = mean(faithful), stdev = sd(faithful)) describe(faithfulfaces, by = face_sex, avg = mean(faithful), stdev = sd(faithful))
This function is a wrapper to dplyr
's summarize
used with the
across
function. For each variable in a set of variables, calculate each
summary statistic from a list of summary statistic functions. Optionally,
group the variables by a grouping variable, and then calculate the
statistics. Optionally, the tibble that is returned by default, which is in a
wide format, can be pivoted to a long format.
describe_across(data, variables, functions, by = NULL, pivot = FALSE)
describe_across(data, variables, functions, by = NULL, pivot = FALSE)
data |
A data frame |
variables |
A vector of variables in |
functions |
A list of summary statistic function. If it is named list, which is recommended, the names of the functions will be used to make the names of the returned data frame. |
by |
A grouping variable. If included, the |
pivot |
A logical variable indicating if the wide format da |
A tibble data frame. If pivot = F
, which is the default, the data
frames contains one row per value of the by
variable, or just one row overall
if there is no by
variable. If pivot = T
, there will be k
+ 1 columns
if there is no by
variable, or k
+ 2 columns if there is a by
variable,
where k
is the number of functions.
describe_across(faithfulfaces, variables = c(trustworthy, faithful), functions = list(avg = mean, stdev = sd), pivot = TRUE) describe_across(faithfulfaces, variables = c(trustworthy, faithful), functions = list(avg = mean, stdev = sd), by = face_sex) describe_across(faithfulfaces, variables = c(trustworthy, faithful), functions = list(avg = mean, stdev = sd), by = face_sex, pivot = TRUE)
describe_across(faithfulfaces, variables = c(trustworthy, faithful), functions = list(avg = mean, stdev = sd), pivot = TRUE) describe_across(faithfulfaces, variables = c(trustworthy, faithful), functions = list(avg = mean, stdev = sd), by = face_sex) describe_across(faithfulfaces, variables = c(trustworthy, faithful), functions = list(avg = mean, stdev = sd), by = face_sex, pivot = TRUE)
Remove a row if all values on selected columns, or by default, on all columns, are missing, i.e. have values of NA or NaN.
drop_if_all_na(data, ...)
drop_if_all_na(data, ...)
data |
A data frame |
... |
< |
The drop_na function will remove any row if it has any NA in selected columns.
By default, it will remove the row there is any NA or NaN in any column.
This drop_if_all_na
function is similar but removes the row only if all values in the selected columns are NA or NaN.
As with drop_na, by default it will use all columns.
In other words, by default, drop_if_all_na
removes any row if all values on that row are NA or NaN.
A data frame, possibly with some rows dropped.
data_df <- data.frame(x = c(1, 2, NA, NA), y = c(2, NA, 5, NA)) drop_if_all_na(data_df) drop_if_all_na(data_df, x) drop_if_all_na(data_df, y) drop_if_all_na(data_df, x, y) drop_if_all_na(data_df, x:y) drop_if_all_na(data_df, starts_with('x'), ends_with('y'))
data_df <- data.frame(x = c(1, 2, NA, NA), y = c(2, NA, 5, NA)) drop_if_all_na(data_df) drop_if_all_na(data_df, x) drop_if_all_na(data_df, y) drop_if_all_na(data_df, x, y) drop_if_all_na(data_df, x:y) drop_if_all_na(data_df, starts_with('x'), ends_with('y'))
This is wrapper to the ez::ezANOVA()
function.
ez_anova( data, dv, wid, within = NULL, within_full = NULL, within_covariates = NULL, between = NULL, between_covariates = NULL, observed = NULL, diff = NULL, reverse_diff = FALSE, type = 2, white.adjust = FALSE, detailed = FALSE, return_aov = FALSE )
ez_anova( data, dv, wid, within = NULL, within_full = NULL, within_covariates = NULL, between = NULL, between_covariates = NULL, observed = NULL, diff = NULL, reverse_diff = FALSE, type = 2, white.adjust = FALSE, detailed = FALSE, return_aov = FALSE )
data |
Data frame containing the data to be analyzed. |
dv |
Name of the column in |
wid |
Name of the column in |
within |
Names of columns in |
within_full |
Same as within, but intended to specify the full within-Ss design in cases where the data have not already been collapsed to means per condition specified by |
within_covariates |
Names of columns in |
between |
Names of columns in |
between_covariates |
Names of columns in |
observed |
Names of columns in |
diff |
Names of any variables to collapse to a difference score. If a single value, may be specified by name alone; if multiple values, must be specified as a .() list. |
reverse_diff |
Logical. If TRUE, triggers reversal of the difference collapse requested by |
type |
Numeric value (either |
white.adjust |
Only affects behaviour if the design contains only between-Ss predictor variables. If not FALSE, the value is passed as the white.adjust argument to Anova, which provides heteroscedasticity correction. |
detailed |
Logical. If TRUE, returns extra information (sums of squares columns, intercept row, etc.) in the ANOVA table. |
return_aov |
Logical. If TRUE, computes and returns an aov object corresponding to the requested ANOVA (useful for computing post-hoc contrasts). |
A list containing one or more components as returned by ez::ezANOVA()
.
ez_anova(data = selfesteem2_long, dv = score, wid = id, within = c(time, treatment), detailed = TRUE, return_aov = TRUE)
ez_anova(data = selfesteem2_long, dv = score, wid = id, within = c(time, treatment), detailed = TRUE, return_aov = TRUE)
Ratings from a facial photo and actual faithfulness.
faithfulfaces
faithfulfaces
A data frame with 170 observations on the following 7 variables.
Rating of sexual dimorphism (masculinity for males, femininity for females)
Rating of attractiveness
Was the face subject unfaithful to a partner?
Rating of trustworthiness
Rating of faithfulness
Sex of face (female or male)
Sex of rater (female or male)
College students were asked to look at a photograph of an opposite-sex adult face and to rate the person, on a scale from 1 (low) to 10 (high), for attractiveness. They were also asked to rate trustworthiness, faithfulness, and sexual dimorphism (i.e., how masculine a male face is and how feminine a female face is). Overall, 68 students (34 males and 34 females) rated 170 faces (88 men and 82 women).
This data set was taken from the
Stats2Data
R package. From the description in that package, the original is based on
G. Rhodes et al. (2012), "Women can judge sexual unfaithfulness from
unfamiliar men's faces," Biology Letters, November 2012. All of the 68
raters were heterosexual Caucasians, as were the 170 persons who were
rated. (We have deleted 3 subjects with missing values and 16 subjects who
were over age 35.)
For each value of a categorical variables, show the binary
code used in a regression model to represent its value.
This is wrapper to the fastDummies::dummy_cols()
function.
get_dummy_code(Df, variable)
get_dummy_code(Df, variable)
Df |
A data frame |
variable |
A categorical variable (e.g. character vector or factor). |
A data frame whose rows provide the dummy code for
each distinct value of variable
.
get_dummy_code(PlantGrowth, group)
get_dummy_code(PlantGrowth, group)
This is a wrapper to the typical ggplot
based histogram, i.e., using
geom_histogram
. A continuous variable, x
, is required as an input.
Optionally, a by
categorical variable can be provided.
histogram( x, data, by = NULL, position = "stack", facet = NULL, facet_type = "wrap", bins = 10, alpha = 1, xlab = NULL, ylab = NULL )
histogram( x, data, by = NULL, position = "stack", facet = NULL, facet_type = "wrap", bins = 10, alpha = 1, xlab = NULL, ylab = NULL )
x |
The numeric variable that is to be histogrammed. |
data |
A data frame with at least one numeric variable (the |
by |
A categorical variable by which to group the |
position |
If the |
facet |
A character string or character vector. If provided, we
|
facet_type |
By default, this takes the value of |
bins |
The number of bins to use in the histogram. |
alpha |
The transparency to for the filled histogram bars. This is
probably only required when using |
xlab |
The label of the x-axis (defaults to the |
ylab |
The label of the y-axis (defaults to the |
A ggplot2::ggplot
object, which may be modified with further ggplot2
commands.
histogram(x= age, data = schizophrenia, by = gender, bins = 20) histogram(x= age, data = schizophrenia, by = gender, position = 'identity', bins = 20, alpha = 0.7) histogram(x= age, data = schizophrenia, by = gender, position = 'dodge', bins = 20) histogram(x = weight, bins = 20, data = ansur, facet = height_tercile) histogram(x = weight, bins = 20, data = ansur, facet = c(height_tercile, age_tercile), facet_type = 'grid')
histogram(x= age, data = schizophrenia, by = gender, bins = 20) histogram(x= age, data = schizophrenia, by = gender, position = 'identity', bins = 20, alpha = 0.7) histogram(x= age, data = schizophrenia, by = gender, position = 'dodge', bins = 20) histogram(x = weight, bins = 20, data = ansur, facet = height_tercile) histogram(x = weight, bins = 20, data = ansur, facet = c(height_tercile, age_tercile), facet_type = 'grid')
Make a interaction line plot
interaction_line_plot(y, x, by, data, ylim = NULL, xlab = NULL, ylab = NULL)
interaction_line_plot(y, x, by, data, ylim = NULL, xlab = NULL, ylab = NULL)
y |
A continuous variable to be plotted along the y-axis |
x |
A continuous variable to be plotted along the x-axis |
by |
A categorical variable by which we split the data and create one line plot for each resulting group |
data |
A data frame with the |
ylim |
A vector of limits for the y-axis |
xlab |
The label of the x-axis (defaults to the |
ylab |
The label of the y-axis (defaults to the |
A ggplot2::ggplot
object, which may be modified with further ggplot2
commands.
interaction_line_plot(y = score, x = time, by = treatment, data = selfesteem2_long, ylim = c(70, 100)) interaction_line_plot(y = score, x = time, by = treatment, data = selfesteem2_long, xlab = 'measurement time', ylab = 'self esteem score', ylim = c(70, 100))
interaction_line_plot(y = score, x = time, by = treatment, data = selfesteem2_long, ylim = c(70, 100)) interaction_line_plot(y = score, x = time, by = treatment, data = selfesteem2_long, xlab = 'measurement time', ylab = 'self esteem score', ylim = c(70, 100))
Contains the job satisfaction score organized by gender and education level.
This data set was taken from the
datarium
R
package.
data("jobsatisfaction")
data("jobsatisfaction")
A data frame with 58 rows and 3 columns.
data(jobsatisfaction) jobsatisfaction
data(jobsatisfaction) jobsatisfaction
A wrapper to stats::t.test()
with paired = TRUE
.
paired_t_test(y1, y2, data, ...)
paired_t_test(y1, y2, data, ...)
y1 |
A numeric vector of observations |
y2 |
A numeric vector of observations, with each value of y2 is assumed to be paired, such as by repeated measures, the corresponding value of y1. |
data |
A data frame with |
... |
Additional arguments passed to |
A list with class "htest" as returned by stats::t.test()
.
paired_t_test(y1, y2, data = pairedsleep)
paired_t_test(y1, y2, data = pairedsleep)
Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.
pairedsleep
pairedsleep
A data frame with 10 observations on the following 3 variables.
The patient ID.
The increase in hours, relative to control, for drug 1.
The increase in hours, relative to control, for drug 2.
This data is a transformed version of datasets::sleep.
This is a wrapper to the GGally
based pairs plot of a list of variables
displayed as scatterplots for pairs of continuous variables, density functions in
the diagonal, and boxplots for pairs of continuous and categorical variables.
Optionally, a by
categorical variable can be provided.
pairs_plot(variables, data, by = NULL)
pairs_plot(variables, data, by = NULL)
variables |
A vector of variable names |
data |
The data frame. |
by |
An optional variable, usually categorical (factor or character), by which the data are grouped and coloured. |
A GGally::ggpairs
plot.
# A simple pairs plot pairs_plot(variables = c("sex_dimorph", "attractive"), data = faithfulfaces) # A pairs plot with grouping variable pairs_plot(variables = c("sex_dimorph", "attractive"), by = face_sex, data = faithfulfaces)
# A simple pairs plot pairs_plot(variables = c("sex_dimorph", "attractive"), data = faithfulfaces) # A pairs plot with grouping variable pairs_plot(variables = c("sex_dimorph", "attractive"), by = face_sex, data = faithfulfaces)
This is wrapper to the pairwise.t.test
function. The p-value adjustment is
"bonferroni" by default. Other possible values are "holm", "hochberg",
"hommel", "BH", "BY", "fdr", "none". See stats::p.adjust()
.
pairwise_t_test(formula, data, p_adj = "bonferroni")
pairwise_t_test(formula, data, p_adj = "bonferroni")
formula |
A two sided formula with one variable on either side, e.g. y ~
x, where the left hand side, dependent, variable is a numeric variable in
|
data |
A data frame that contains the dependent and independent variables. |
p_adj |
The p-value adjustment method (see Description). |
An object of class pairwise.htest
as returned by stats::pairwise.t.test()
.
data_df <- dplyr::mutate(vizverb, IV = interaction(task, response)) pairwise_t_test(time ~ IV, data = data_df)
data_df <- dplyr::mutate(vizverb, IV = interaction(task, response)) pairwise_t_test(time ~ IV, data = data_df)
Recode specified values by new values
re_code(x, from, to)
re_code(x, from, to)
x |
A vector, including column of data frame |
from |
The set of old values to be replaced by new ones |
to |
The set of new values to replace the old ones |
A vector that is the input vector but with old values replaced by new ones.
# Replace any occurrence of 1 and 2 with 101 and 201, respectively x <- c(1, 2, 3, 4, 5, 1, 2) re_code(x, from = c(1, 2), to = c(101, 201))
# Replace any occurrence of 1 and 2 with 101 and 201, respectively x <- c(1, 2, 3, 4, 5, 1, 2) re_code(x, from = c(1, 2), to = c(101, 201))
Remove the first row of a data frame assuming that row was essentially a second (and redundant) header row in the original raw data file. After that row is removed, the data frame is reparsed to reinfer the data-types of each column.
remove_double_header(data_df)
remove_double_header(data_df)
data_df |
A data frame where it is assumed that the first row provides redundant header information and so it needs to be removed. |
Some software, including Qualtrics
(survey software) and Gorilla (behavioural
experiment software), sometimes export their data where the first two rows
are both essentially headers, i.e., column labels. These two rows are not
identical and often the second is redundant and so needs to be skipped.
Data import functions like read_csv, and many others, do not
let you skip the second row if the first row is not skipped. On the other
hand, it is easy to read in all the data as per usual and then use, for
example, slice, to remove the second row in the original. For
example, slice(data_df, -1)
will remove the first row in the data frame
named data_df
, which would be the second row of the original data file
(assuming, as is common, that the first row of the original was used as the
header to create the column names).
Although removing one row is easy to accomplish using basic tools in R, the bigger problem is that when the data was originally imported, it probably parsed all columns as character vectors. This is because the presence of header information in the second row of the original data, which are usually parsed as strings, forced the parser in a function like read_csv to parse the whole column as a character vector. After that second header row is removed, all the columns still remain as character vectors even though they could be, numeric, logical, etc. It is possible to use, for example, mutate and across to recode these columns, but that is not always possible with one simple command.
An alternative approach is, after the header row is removed, to reparse all the columns to infer their data types and then automatically recode them. This is what is done in this function. The parser that is used is the one used by readr.
Note that this reparsing is no more, or no less, foolproof than what happens when we ever use, for example, read_csv to import data without specifying explicitly the data type for each column, which is commonly done. Given this, it is wise to check the new data types to make sure that there are no errors.
A new data frame where the data types of all columns were re-inferred after the first row was removed.
double_headered_csv <- ' a,b,c x,x,x 1,2024/12/27,TRUE 2,2024/12/17,TRUE 3,2024/12/27,FALSE ' readr::read_csv(double_headered_csv) |> remove_double_header()
double_headered_csv <- ' a,b,c x,x,x 1,2024/12/27,TRUE 2,2024/12/17,TRUE 3,2024/12/27,FALSE ' readr::read_csv(double_headered_csv) |> remove_double_header()
This function will rename a selection of columns as, for
example, var_1
, var_2
, var_2
... var_10
, where the prefix, var
in
this example, is arbitrary.
rename_with_seq(data_df, col_selector, prefix = "var")
rename_with_seq(data_df, col_selector, prefix = "var")
data_df |
A data frame |
col_selector |
A tidy selector, e.g. |
prefix |
The prefix for the sequence, e.g. 'drug' to produce names like
|
If we had, for example, a data frame where columns were the names of
drugs and we wanted to rename these columns something like drug_1
,
drug_2
, ..., this would be easy to do with rename if there
were just a few columns to rename. When there are more than just a few,
individual renaming is somewhat tedious and error prone. We can use
rename_with to do this in one operation. However, the code
for doing so is not very simple and would require some proficiency in R and
tidyverse
. This function is essentially just a wrapper to a rename_with
function to allow the renaming to be done in one simple command.
A data frame with renamed columns
data_df <- readr::read_csv(' subject, age, gender, Aripiprazole, Clozapine, Olanzapine, Quetiapine A, 27, F, 20, 10, 40, 25 B, 23, M, 21, 21, 35, 27 ') rename_with_seq(data_df, col_selector = Aripiprazole:Quetiapine, prefix = 'drug')
data_df <- readr::read_csv(' subject, age, gender, Aripiprazole, Clozapine, Olanzapine, Quetiapine A, 27, F, 20, 10, 40, 25 B, 23, M, 21, 21, 35, 27 ') rename_with_seq(data_df, col_selector = Aripiprazole:Quetiapine, prefix = 'drug')
This function is a wrapper around the typical ggplot
command to create two
dimensional scatterplots, i.e. using geom_point
. It provides the option of
colouring point by a third variable, one that is usually, though not
necessarily categorical. Also, it provides the option of placing the line of
best fit on the scatterplot. If points are coloured by a categorical
variable, the a different line of best for each value of the categorical
variable is provided.
scatterplot( x, y, data, by = NULL, best_fit_line = FALSE, xlab = NULL, ylab = NULL )
scatterplot( x, y, data, by = NULL, best_fit_line = FALSE, xlab = NULL, ylab = NULL )
x |
A numeric variable in |
y |
A numeric variable in |
data |
A data frame with the |
by |
An optional variable, usually categorical (factor or character), by which the points in the scatterplot are byed and coloured. |
best_fit_line |
A logical variable indicating if the line of best fit should shown or not. |
xlab |
The label of the x-axis (defaults to the |
ylab |
The label of the y-axis (defaults to the |
A ggplot2::ggplot
object, which may be modified with further ggplot2
commands.
scatterplot(x = attractive, y = trustworthy, data = faithfulfaces) scatterplot(x = attractive, y = trustworthy, data = faithfulfaces, xlab = 'attractiveness', ylab = 'trustworthiness') scatterplot(x = attractive, y = trustworthy, data = faithfulfaces, by = face_sex) scatterplot(x = trustworthy, y = faithful, data = faithfulfaces, by = face_sex, best_fit_line = TRUE)
scatterplot(x = attractive, y = trustworthy, data = faithfulfaces) scatterplot(x = attractive, y = trustworthy, data = faithfulfaces, xlab = 'attractiveness', ylab = 'trustworthiness') scatterplot(x = attractive, y = trustworthy, data = faithfulfaces, by = face_sex) scatterplot(x = trustworthy, y = faithful, data = faithfulfaces, by = face_sex, best_fit_line = TRUE)
Make a scatterplot matrix
scatterplot_matrix(.data, ..., .by = NULL, .bins = 10)
scatterplot_matrix(.data, ..., .by = NULL, .bins = 10)
.data |
A data frame |
... |
A comma separated list of tidyselections of columns. This can be as simple as a set of column names. |
.by |
An optional categorical variable by which to group and colour the points. |
.bins |
The number of bins in the histograms on diagonal of matrix. |
A GGally::ggpairs
plot.
data_df <- test_psychometrics %>% total_scores(x = starts_with('x_'), y = starts_with('y_'), z = starts_with('z_')) scatterplot_matrix(data_df, x, y, z)
data_df <- test_psychometrics %>% total_scores(x = starts_with('x_'), y = starts_with('y_'), z = starts_with('z_')) scatterplot_matrix(data_df, x, y, z)
Data on sex differences in the age of onset of schizophrenia.
schizophrenia
schizophrenia
A data frame with 251 observations on the following 2 variables.
Age at the time of diagnosis.
A categorical variable with values female
and male
A sex difference in the age of onset of schizophrenia was noted by Kraepelin (1919). Subsequently epidemiological studies of the disorder have consistently shown an earlier onset in men than in women. One model that has been suggested to explain this observed difference is known as the subtype model which postulates two type of schizophrenia, one characterised by early onset, typical symptoms and poor premorbid competence, and the other by late onset, atypical symptoms, and good premorbid competence. The early onset type is assumed to be largely a disorder of men and the late onset largely a disorder of women.
This data set was taken from the
HSAUR
R
package. From the description in that package, the original is E.
Kraepelin (1919), Dementia Praecox and Paraphrenia. Livingstone,
Edinburgh.
The dataset contains 10 individuals' self-esteem score on three time points during a specific diet to determine whether their self-esteem improved.
One-way repeated measures ANOVA can be performed in order to determine the effect of time on the self-esteem score.
This data set was taken from the
datarium
R
package.
data("selfesteem")
data("selfesteem")
A data frame with 10 rows and 4 columns.
data(selfesteem) selfesteem
data(selfesteem) selfesteem
Data are the self esteem score of 12 individuals enrolled in 2 successive short-term trials (4 weeks) - control (placebo) and special diet trials.
The self esteem score was recorded at three time points: at the beginning (t1), midway (t2) and at the end (t3) of the trials.
The same 12 participants are enrolled in the two different trials with enough time between trials.
Two-way repeated measures ANOVA can be performed in order to determine whether there is interaction between time and treatment on the self esteem score.
This data set was taken from the
datarium
R
package.
data("selfesteem2")
data("selfesteem2")
A data frame with 24 rows and 5 columns.
data(selfesteem2) selfesteem2
data(selfesteem2) selfesteem2
Data are the self esteem score of 12 individuals enrolled in 2 successive short-term trials (4 weeks) - control (placebo) and special diet trials.
The self esteem score was recorded at three time points: at the beginning (t1), midway (t2) and at the end (t3) of the trials.
The same 12 participants are enrolled in the two different trials with enough time between trials.
Two-way repeated measures ANOVA can be performed in order to determine whether there is interaction between time and treatment on the self esteem score.
This data set was converted from the selfesteem2
data taken from the
datarium
R
package.
data("selfesteem2_long")
data("selfesteem2_long")
A data frame with 72 rows and 4 columns.
Unique ID of the person
Binary variable indicating the treatment condition: Diet
or ctr
.
A categorical variable indicating the time of measurement: beginning (t1
), midway (t2
) and at the end (t3
)
Self-esteem score
data(selfesteem2_long) selfesteem2_long
data(selfesteem2_long) selfesteem2_long
This function is a wrapper around stats::shapiro.test()
.
It implements the Shapiro-Wilk test that tests the null hypothesis that a sample of values is a sample from a normal distribution.
Thie function can be applied to single vectors or groups of vectors.
shapiro_test(y, by = NULL, data)
shapiro_test(y, by = NULL, data)
y |
A numeric variable whose normality is being tested. |
by |
An optional grouping variable |
data |
A data frame containing |
A tibble data frame with one row for each value of the by
variable,
or one row overall if there is no by
variable. For the y
variable whose
normality is being tested, for each subset of values corresponding to the
values of they by
variable, or for all values if there is no by
variable, return the Shapiro-Wilk statistic, and the corresponding p-value.
shapiro_test(faithful, data = faithfulfaces) shapiro_test(faithful, by = face_sex, data = faithfulfaces)
shapiro_test(faithful, data = faithfulfaces) shapiro_test(faithful, by = face_sex, data = faithfulfaces)
Most descriptive statistic function like base::sum()
, base::mean()
,
stats::median()
, etc., do not skip NA
values when computing the results
and so always return NA
if there is at least one NA
in the input vector.
The NA
values can be skipped always by setting the na.rm
argument to
TRUE
. While this is simply to do usually, in some cases, such as when a
function is being passed to another function, setting na.rm = TRUE
in that
function requires creating a new anonymous function. The functions here,
which all end in _xna
, are wrappers to common statistics functions, but
with na.rm = TRUE
.
sum_xna(...) mean_xna(...) median_xna(...) iqr_xna(...) sd_xna(...) var_xna(...)
sum_xna(...) mean_xna(...) median_xna(...) iqr_xna(...) sd_xna(...) var_xna(...)
... |
Arguments to a descriptive statistic function |
A numeric vector, usually with one element, that provides the result
of a descriptive statistics function applied to a vector after the NA
values have been removed.
mean_xna()
: The arithmetic mean for vectors with missing values.
median_xna()
: The median for vectors with missing values.
iqr_xna()
: The interquartile range for vectors with missing values.
sd_xna()
: The standard deviation for vectors with missing values.
var_xna()
: The variance for vectors with missing values.
set.seed(10101) # Make a vector of random numbers x <- runif(10, min = 10, max = 20) # Concatenate with a NA value x1 <- c(NA, x) sum(x) sum(x1) # Will be NA sum_xna(x1) # Will be same as sum(x) stopifnot(sum_xna(x1) == sum(x)) stopifnot(mean_xna(x1) == mean(x)) stopifnot(median_xna(x1) == median(x)) stopifnot(iqr_xna(x1) == IQR(x)) stopifnot(sd_xna(x1) == sd(x)) stopifnot(var_xna(x1) == var(x))
set.seed(10101) # Make a vector of random numbers x <- runif(10, min = 10, max = 20) # Concatenate with a NA value x1 <- c(NA, x) sum(x) sum(x1) # Will be NA sum_xna(x1) # Will be same as sum(x) stopifnot(sum_xna(x1) == sum(x)) stopifnot(mean_xna(x1) == mean(x)) stopifnot(median_xna(x1) == median(x)) stopifnot(iqr_xna(x1) == IQR(x)) stopifnot(sd_xna(x1) == sd(x)) stopifnot(var_xna(x1) == var(x))
A wrapper to stats::t.test()
with var.equal = TRUE
.
t_test(formula, data)
t_test(formula, data)
formula |
A two sided formula with one variable on either side, e.g. y ~
x, where the left hand side, dependent, variable is a numeric variable in
|
data |
A data frame that contains the dependent and independent variables. |
A list with class "htest" as returned by stats::t.test()
.
t_test(trustworthy ~ face_sex, data = faithfulfaces)
t_test(trustworthy ~ face_sex, data = faithfulfaces)
Typical psychometrics raw data files have multiple psychometric
variables (scales), each with multiple constituent items.
In this data set, there are three psychometric variables, each with 10 constituent items.
The variables can be labelled x
, y
, and z
.
The constituent items of x
, y
and z
are x_1, x_2 ... x_10
,
y_1, y_2 ... y_10
, z_1, z_2 ... z_10
, respectively.
data('test_psychometrics')
data('test_psychometrics')
A data frame with 44 rows and 30 columns
data(test_psychometrics) test_psychometrics
data(test_psychometrics) test_psychometrics
This function formats specified numeric columns in a data frame to a fixed number of decimal places.
to_fixed_digits(data, ..., .digits = 3)
to_fixed_digits(data, ..., .digits = 3)
data |
A data frame or tibble containing the columns to format. |
... |
< |
.digits |
An integer specifying the number of decimal places to format to. Default is 3. |
Tibble data frames display numeric values to a certain number of significant
figures, determined by the pillar.sigfig
option. Sometimes it
may be useful or necessary to see values to a fixed number of digits. This
can be accomplished with num. This function is a convenience function that applies
num to all, or a specified subset, of the numeric vectors in a
tibble.
A data frame with the selected numeric columns formatted to the specified number of decimal places.
# Format all numeric columns to 3 decimal places mtcars_df <- tibble::as_tibble(mtcars) to_fixed_digits(mtcars_df) # Format columns mpg to qsec to 3 decimal places to_fixed_digits(mtcars_df, mpg:qsec) # Format specific columns to 2 decimal places to_fixed_digits(mtcars_df, mpg, hp, .digits = 2)
# Format all numeric columns to 3 decimal places mtcars_df <- tibble::as_tibble(mtcars) to_fixed_digits(mtcars_df) # Format columns mpg to qsec to 3 decimal places to_fixed_digits(mtcars_df, mpg:qsec) # Format specific columns to 2 decimal places to_fixed_digits(mtcars_df, mpg, hp, .digits = 2)
Calculate the total scores from sets of scores
total_scores(.data, ..., .method = "mean", .append = FALSE, .drop = FALSE)
total_scores(.data, ..., .method = "mean", .append = FALSE, .drop = FALSE)
.data |
A data frame with columns to summed or averaged over. |
... |
A comma separated set of named tidy selectors, each of which selects a set of columns to which to apply the totalling function. |
.method |
The method used to calculate the total. Must be one of "mean", "sum", or "sum_like". The "mean" is the arithmetic mean, skipping missing values. The "sum" is the sum, skipping missing values. The "sum_like" is the arithmetic mean, again skipping missing values, multiplied by the number of elements, including missing values. |
.append |
logical If FALSE, just the totals be returned. If TRUE, the totals are appended as new columns to original data frame. |
.drop |
logical If .append is TRUE, and if .drop is TRUE, then the variables being aggregated over are not returned. |
A new data frame with columns representing the total scores.
# Calculate the mean of all items beginning with `x_` and separately all items beginning with `y_` total_scores(test_psychometrics, x = starts_with('x_'), y = starts_with('y_')) # Calculate the sum of all items beginning with `z_` and separately all items beginning with `x_` total_scores(test_psychometrics, .method = 'sum', z = starts_with('z_'), x = starts_with('x_')) # Calculate the mean of all items from `x_1` to `y_10` total_scores(test_psychometrics, xy = x_1:y_10) # Calculate the mean of all items beginning with `x_` and separately all items beginning with `y_`, # but append these means to the original, after have dropping the variables that # are aggregated over total_scores(test_psychometrics, x = starts_with('x_'), y = starts_with('y_'), .append = T, .drop = T)
# Calculate the mean of all items beginning with `x_` and separately all items beginning with `y_` total_scores(test_psychometrics, x = starts_with('x_'), y = starts_with('y_')) # Calculate the sum of all items beginning with `z_` and separately all items beginning with `x_` total_scores(test_psychometrics, .method = 'sum', z = starts_with('z_'), x = starts_with('x_')) # Calculate the mean of all items from `x_1` to `y_10` total_scores(test_psychometrics, xy = x_1:y_10) # Calculate the mean of all items beginning with `x_` and separately all items beginning with `y_`, # but append these means to the original, after have dropping the variables that # are aggregated over total_scores(test_psychometrics, x = starts_with('x_'), y = starts_with('y_'), .append = T, .drop = T)
This function is a wrapper around a typical ggplot
based box-and-whisker
plot, i.e. using geom_boxplot
, which implements the Tukey variant of the
box-and-whisker plot. The y
variable is the outcome variable whose
distribution is represented by the box-and-whisker plot. If the x
variable
is missing, then a single box-and-whisker plot using all values of y
is
shown. If an x
variable is used, this is used an the independent variable
and one box-and-whisker plot is provided for each set of y
values that
correspond to each unique value of x
. For this reason, x
is usually a
categorical variable. If x
is a continuous numeric variable, it ideally
should have relatively few unique values, so that each value of x
corresponds to a sufficiently large set of y
values.
tukeyboxplot( y, x, data, by = NULL, jitter = FALSE, box_width = 1/3, jitter_width = 1/5, xlab = NULL, ylab = NULL )
tukeyboxplot( y, x, data, by = NULL, jitter = FALSE, box_width = 1/3, jitter_width = 1/5, xlab = NULL, ylab = NULL )
y |
The outcome variable |
x |
The optional independent/predictor/grouping variable |
data |
The data frame with the |
by |
An optional variable, usually categorical (factor or character), by which the points in the box-and-whisker plots are grouped and coloured. |
jitter |
A logical variable, defaulting to |
box_width |
The width of box in each box-and-whisker plot. The default
used, |
jitter_width |
The width of the jitter relative to box width. For
example, set |
xlab |
The label of the x-axis (defaults to the |
ylab |
The label of the y-axis (defaults to the |
A ggplot2::ggplot
object, which may be modified with further ggplot2
commands.
# A single box-and-whisker plot tukeyboxplot(y = time, data = vizverb) # One box-and-whisker plot for each value of a categorical variable tukeyboxplot(y = time, x = task, data = vizverb) # Box-and-whisker plots with jitters tukeyboxplot(y = time, x = task, data = vizverb, jitter = TRUE) # `tukeyboxplot` can be used with a continuous numeric variable too tukeyboxplot(y = len, x = dose, data = ToothGrowth) tukeyboxplot(y = len, x = dose, data = ToothGrowth, by = supp, jitter = TRUE, box_width = 0.5, jitter_width = 1)
# A single box-and-whisker plot tukeyboxplot(y = time, data = vizverb) # One box-and-whisker plot for each value of a categorical variable tukeyboxplot(y = time, x = task, data = vizverb) # Box-and-whisker plots with jitters tukeyboxplot(y = time, x = task, data = vizverb, jitter = TRUE) # `tukeyboxplot` can be used with a continuous numeric variable too tukeyboxplot(y = len, x = dose, data = ToothGrowth) tukeyboxplot(y = len, x = dose, data = ToothGrowth, by = supp, jitter = TRUE, box_width = 0.5, jitter_width = 1)
An experiment studying the interaction between visual versus perception and visual versus verbal responses.
vizverb
vizverb
A data frame with 80 observations on the following 5 variables.
Subject identifying number (s1
to s20
)
Describe a diagram (visual
) or a sentence (verbal
)
Point response (visual
) or say response (verbal
)
Response time (in seconds)
Subjects carried out two kinds of tasks. One task was visual (describing a diagram), and the other was classed as verbal (reading and describing a sentence sentences). They reported the results either by pointing (a "visual" response), or speaking (a verbal response). Time to complete each task was recorded in seconds.
This data set was taken from the
Stats2Data
R package. From the description in that package, the original data appear
to have been collected in a Mount Holyoke College psychology class based
replication of an experiment by Brooks, L., R. (1968) "Spatial and verbal
components of the act of recall," Canadian J. Psych. V 22, pp. 349 - 368.