Package 'psyntur'

Title: Helper Tools for Teaching Statistical Data Analysis
Description: Provides functions and data-sets that are helpful for teaching statistics and data analysis. It was originally designed for use when teaching students in the Psychology Department at Nottingham Trent University.
Authors: Mark Andrews [aut, cre], Jens Roeser [aut]
Maintainer: Mark Andrews <[email protected]>
License: GPL-3
Version: 0.1.1
Built: 2024-11-13 09:19:18 UTC
Source: https://github.com/mark-andrews/psyntur

Help Index


Anthropometric data from US Army Personnel

Description

Data on the height, weight, handedness from men and women of different ages and different races.

Usage

ansur

Format

A data frame with 6068 observations from 9 variables.

subjectid

Unique ID of the person

gender

Binary variable indicating the subject's sex: male or female.

height

Height in centimeters.

weight

Weight in kilograms.

handedness

Categorical variable indicating if the person is left, or right handed, or both.

age

Age in years

race

Race, with categories like white, black, hispanic.

height_tercile

The tercile of the person's height.

age_tercile

The tercile of the person's weight.

Source

This data is a transformed version of data sets obtained the Anthropometric Survey of US Army Personnel (ANSUR 2 or ANSUR II).


Cohen's d and Hedges g effect size

Description

This is wrapper to the effsize::cohen.d() function.

Usage

cohen_d(...)

Arguments

...

A comma separated list of arguments. See effsize::cohen.d().

Value

A list of class effsize as returned by effsize::cohen.d().

Examples

cohen_d(weight ~ gender, data = ansur)
cohen_d(age ~ gender, data = schizophrenia)

Test for Correlation Between Paired Samples

Description

This function is a wrapper around stats::cor.test(). It implements the Pearson's correlation test that tests the null hypothesis that two paired samples of values are unrelated. This function must be applied to two numeric vectors.

Usage

cor_test(x, y, method = "pearson", data)

Arguments

x

A numeric variable.

y

A numeric variable.

method

A character string indicating which correlation coefficient is to be used: "pearson", "kendall", or "spearman". Default method is "pearson".

data

A data frame containing the y and x variables

Value

A tibble data frame with the correlation statistic, and the corresponding p-value.

Examples

cor_test(y = sex_dimorph, x = attractive, data = faithfulfaces)
cor_test(y = sex_dimorph, x = attractive, method = "spearman", data = faithfulfaces)

Test for Correlation Between Paired Samples for 2 or More Variables

Description

This function is a wrapper around stats::cor.test(). It implements the Pearson's correlation test that tests the null hypothesis that two or more paired samples of values are unrelated. This function can be applied to two or more numeric variables in the provided data.

Usage

cor_test_multi(
  .data,
  ...,
  .pvalues = FALSE,
  .ci = FALSE,
  .as_matrix = TRUE,
  .omit_redundancies = FALSE,
  .method = "pearson"
)

Arguments

.data

A data frame.

...

Variables for which the correlation coefficient should be returned. If no variable name is provided, correlations will be returned for all numeric variables in .data.

.pvalues

logical If FALSE (default), p-values will be omitted from the output. If TRUE, p-values will be included in the output.

.ci

logical If FALSE (default), 95% confidence interval bounds will be omitted from the output. If TRUE, 95% confidence interval bounds will be included in the output.

.as_matrix

logical If TRUE (default), results will be return as matrix. If TRUE, results will be returned as tibble.

.omit_redundancies

logical If FALSE (default), all n^2 correlations will be include in the output. If TRUE, only unique correlations will be returned (x ~ y but not y ~ x) and correlation of a variable with itself will be omitted.

.method

A character string indicating which correlation coefficient is to be used: "pearson", "kendall", or "spearman". Default method is "pearson".

Value

By default a matrix with correlation coefficients. Output format and included statistics can be changed in the argument settings.

Examples

# Calculate the correlations between all numeric variables in the `faithfulfaces` data.
cor_test_multi(faithfulfaces)
# Calculate the correlations between the 1st, 2nd and 4th variable.
cor_test_multi(faithfulfaces, c(1,2,4))
# Calculate the correlations between `sex_dimorph`, `attractive`, and `trustworthy`.
cor_test_multi(faithfulfaces, sex_dimorph, attractive, trustworthy)
# Calculate all correlations and return p-values and 95% confidence intervals.
cor_test_multi(faithfulfaces, .pvalues = TRUE, .ci = TRUE)
# Calculate all correlations with p-values and 95% confidence intervals and 
# return results as table with only unique pairs of the off-diagonal correlations.
cor_test_multi(faithfulfaces, .pvalues = TRUE, .ci = TRUE, .as_matrix = FALSE, 
.omit_redundancies = TRUE)

Calculate Cronbach's alpha for sets of psychometric scale items

Description

This function calculates the Cronbach alpha for one or more sets of psychometric scale items. Each item is a variable in a data frame. Each set of items is defined by a tidy selection of a set of items.

Usage

cronbach(.data, ..., .ci = 0.95)

Arguments

.data

A data frame with columns that are psychometric items.

...

A set of comma separated tidy selectors that selects sets of columns from .data. For each set of columns, the Cronbach's alpha is computed.

.ci

The value of the confidence interval to calculate.

Value

A data frame whose rows are psychometric scales and for each scale, we have the Cronbach's alpha, and the lower and upper bound of the confidence interval on alpha.

Examples

# Return the Cronbach alpha and 95% ci for two scales.
 # The first scale, named `x`, is identified by all items beginning with `x_`.
 # The second scale, named `y`, is identified by the consecutive items from `y_1` to `y_10`.
 cronbach(test_psychometrics,
          x = starts_with('x'),
          y = y_1:y_10)

A density plot

Description

This is a wrapper to the typical ggplot based density plot, i.e., using geom_density. A continuous variable, x, is required as an input. Optionally, a by categorical variable can be provided.

Usage

densityplot(
  x,
  data,
  by = NULL,
  position = "stack",
  facet = NULL,
  facet_type = "wrap",
  alpha = 1,
  xlab = NULL,
  ylab = NULL
)

Arguments

x

The numeric variable that is to be density plotted.

data

A data frame with at least one numeric variable (the x variable).

by

A categorical variable by which to group the x values. If provided there will be one density plot for each set of x values grouped by the values of the by variable.

position

If the by variable is provided, there are three ways these multiple density plots can be positioned: stacked (position = 'stack'), superimposed (⁠position = identity'⁠).

facet

A character string or character vector. If provided, we facet_wrap (by default) the histogram by the variables. This is equivalent to the facet_wrap(variables) in ggplot2.

facet_type

By default, this takes the value of wrap, and facet leads to a facet wrap. If facet_type is grid, then facet gives us a facet_grid.

alpha

The transparency to for the filled histogram bars. This is probably only required when using position = 'identity'.

xlab

The label of the x-axis (defaults to the x variable name).

ylab

The label of the y-axis (defaults to the y variable name).

Value

A ggplot2::ggplot object, which may be modified with further ggplot2 commands.

Examples

densityplot(x = age, data = schizophrenia, by = gender)

Calculate descriptive statistics

Description

This function is a lightweight wrapper to dplyr's summarize function. It can be used to calculate any descriptive or summary statistic for any variable in the data set. Optionally, a by grouping variable can be used, and then the summary statistics are calculated for each subgroup defined by the different values of the by variable.

Usage

describe(data, by = NULL, ...)

Arguments

data

A data frame

by

A grouping variable. If included, the data will be grouped by the values of the by variable before the summary statistics are applied.

...

Arguments of functions applied to variables, e.g. avg = mean(x).

Value

A tibble data frame with each row providing descriptive statistics for selected variables for each value of the grouping by variable.

Examples

describe(faithfulfaces, avg = mean(faithful), stdev = sd(faithful))
describe(faithfulfaces, by = face_sex, avg = mean(faithful), stdev = sd(faithful))

Apply multiple descriptive functions to multiple variables

Description

This function is a wrapper to dplyr's summarize used with the across function. For each variable in a set of variables, calculate each summary statistic from a list of summary statistic functions. Optionally, group the variables by a grouping variable, and then calculate the statistics. Optionally, the tibble that is returned by default, which is in a wide format, can be pivoted to a long format.

Usage

describe_across(data, variables, functions, by = NULL, pivot = FALSE)

Arguments

data

A data frame

variables

A vector of variables in data

functions

A list of summary statistic function. If it is named list, which is recommended, the names of the functions will be used to make the names of the returned data frame.

by

A grouping variable. If included, the data will be grouped by the values of the by variable before the summary statistics are applied.

pivot

A logical variable indicating if the wide format da

Value

A tibble data frame. If pivot = F, which is the default, the data frames contains one row per value of the by variable, or just one row overall if there is no by variable. If pivot = T, there will be k + 1 columns if there is no by variable, or k + 2 columns if there is a by variable, where k is the number of functions.

Examples

describe_across(faithfulfaces, 
                variables = c(trustworthy, faithful), 
                functions = list(avg = mean, stdev = sd),
                pivot = TRUE)
describe_across(faithfulfaces, 
                variables = c(trustworthy, faithful), 
                functions = list(avg = mean, stdev = sd), 
                by = face_sex)
describe_across(faithfulfaces, 
                variables = c(trustworthy, faithful), 
                functions = list(avg = mean, stdev = sd), 
                by = face_sex,
                pivot = TRUE)

Drop rows if all values on selected columns are missing

Description

Remove a row if all values on selected columns, or by default, on all columns, are missing, i.e. have values of NA or NaN.

Usage

drop_if_all_na(data, ...)

Arguments

data

A data frame

...

<tidy-select> Columns to inspect for missing values. If empty, all columns are used.

Details

The drop_na function will remove any row if it has any NA in selected columns. By default, it will remove the row there is any NA or NaN in any column. This drop_if_all_na function is similar but removes the row only if all values in the selected columns are NA or NaN. As with drop_na, by default it will use all columns. In other words, by default, drop_if_all_na removes any row if all values on that row are NA or NaN.

Value

A data frame, possibly with some rows dropped.

Examples

data_df <- data.frame(x = c(1, 2, NA, NA), y = c(2, NA, 5, NA))

drop_if_all_na(data_df)
drop_if_all_na(data_df, x)
drop_if_all_na(data_df, y)
drop_if_all_na(data_df, x, y)
drop_if_all_na(data_df, x:y)
drop_if_all_na(data_df, starts_with('x'), ends_with('y'))

Analysis of variance

Description

This is wrapper to the ez::ezANOVA() function.

Usage

ez_anova(
  data,
  dv,
  wid,
  within = NULL,
  within_full = NULL,
  within_covariates = NULL,
  between = NULL,
  between_covariates = NULL,
  observed = NULL,
  diff = NULL,
  reverse_diff = FALSE,
  type = 2,
  white.adjust = FALSE,
  detailed = FALSE,
  return_aov = FALSE
)

Arguments

data

Data frame containing the data to be analyzed.

dv

Name of the column in data that contains the dependent variable. Values in this column must be numeric.

wid

Name of the column in data that contains the variable specifying the case/Ss identifier. This should be a unique value per case/Ss.

within

Names of columns in data that contain predictor variables that are manipulated (or observed) within-Ss.

within_full

Same as within, but intended to specify the full within-Ss design in cases where the data have not already been collapsed to means per condition specified by within and when within only specifies a subset of the full design.

within_covariates

Names of columns in data that contain predictor variables that are manipulated (or observed) within-Ss and are to serve as covariates in the analysis.

between

Names of columns in data that contain predictor variables that are manipulated (or observed) between-Ss.

between_covariates

Names of columns in data that contain predictor variables that are manipulated (or observed) between-Ss and are to serve as covariates in the analysis.

observed

Names of columns in data that are already specified in either within or between that contain predictor variables that are observed variables (i.e. not manipulated).

diff

Names of any variables to collapse to a difference score. If a single value, may be specified by name alone; if multiple values, must be specified as a .() list.

reverse_diff

Logical. If TRUE, triggers reversal of the difference collapse requested by diff. Take care with variables with more than 2 levels.

type

Numeric value (either 1, 2 or 3) specifying the Sums of Squares type to employ when data are unbalanced (eg. when group sizes differ).

white.adjust

Only affects behaviour if the design contains only between-Ss predictor variables. If not FALSE, the value is passed as the white.adjust argument to Anova, which provides heteroscedasticity correction.

detailed

Logical. If TRUE, returns extra information (sums of squares columns, intercept row, etc.) in the ANOVA table.

return_aov

Logical. If TRUE, computes and returns an aov object corresponding to the requested ANOVA (useful for computing post-hoc contrasts).

Value

A list containing one or more components as returned by ez::ezANOVA().

Examples

ez_anova(data = selfesteem2_long,
            dv = score,
            wid = id,
            within = c(time, treatment),
            detailed = TRUE,
            return_aov = TRUE)

Faithfulness from a Photo?

Description

Ratings from a facial photo and actual faithfulness.

Usage

faithfulfaces

Format

A data frame with 170 observations on the following 7 variables.

sex_dimorph

Rating of sexual dimorphism (masculinity for males, femininity for females)

attractive

Rating of attractiveness

cheater

Was the face subject unfaithful to a partner?

trustworthy

Rating of trustworthiness

faithful

Rating of faithfulness

face_sex

Sex of face (female or male)

rater_sex

Sex of rater (female or male)

Details

College students were asked to look at a photograph of an opposite-sex adult face and to rate the person, on a scale from 1 (low) to 10 (high), for attractiveness. They were also asked to rate trustworthiness, faithfulness, and sexual dimorphism (i.e., how masculine a male face is and how feminine a female face is). Overall, 68 students (34 males and 34 females) rated 170 faces (88 men and 82 women).

Source

This data set was taken from the Stats2Data R package. From the description in that package, the original is based on G. Rhodes et al. (2012), "Women can judge sexual unfaithfulness from unfamiliar men's faces," Biology Letters, November 2012. All of the 68 raters were heterosexual Caucasians, as were the 170 persons who were rated. (We have deleted 3 subjects with missing values and 16 subjects who were over age 35.)


Show the dummy code of a categorical variable

Description

For each value of a categorical variables, show the binary code used in a regression model to represent its value. This is wrapper to the fastDummies::dummy_cols() function.

Usage

get_dummy_code(Df, variable)

Arguments

Df

A data frame

variable

A categorical variable (e.g. character vector or factor).

Value

A data frame whose rows provide the dummy code for each distinct value of variable.

Examples

get_dummy_code(PlantGrowth, group)

A histogram

Description

This is a wrapper to the typical ggplot based histogram, i.e., using geom_histogram. A continuous variable, x, is required as an input. Optionally, a by categorical variable can be provided.

Usage

histogram(
  x,
  data,
  by = NULL,
  position = "stack",
  facet = NULL,
  facet_type = "wrap",
  bins = 10,
  alpha = 1,
  xlab = NULL,
  ylab = NULL
)

Arguments

x

The numeric variable that is to be histogrammed.

data

A data frame with at least one numeric variable (the x variable).

by

A categorical variable by which to group the x values. If provided there will be one histogram for each set of x values grouped by the values of the by variable.

position

If the by variable is provided, there are three ways these multiple histograms can be positioned: stacked (position = 'stack'), side by side (position = 'dodge'), superimposed (⁠position = identity'⁠).

facet

A character string or character vector. If provided, we facet_wrap (by default) the histogram by the variables. This is equivalent to the facet_wrap(variables) in ggplot2.

facet_type

By default, this takes the value of wrap, and facet leads to a facet wrap. If facet_type is grid, then facet gives us a facet_grid.

bins

The number of bins to use in the histogram.

alpha

The transparency to for the filled histogram bars. This is probably only required when using position = 'identity'.

xlab

The label of the x-axis (defaults to the x variable name).

ylab

The label of the y-axis (defaults to the y variable name).

Value

A ggplot2::ggplot object, which may be modified with further ggplot2 commands.

Examples

histogram(x= age, data = schizophrenia, by = gender, bins = 20)
histogram(x= age, data = schizophrenia, by = gender, position = 'identity', bins = 20, alpha = 0.7)
histogram(x= age, data = schizophrenia, by = gender, position = 'dodge', bins = 20)
histogram(x = weight, bins = 20, data = ansur, facet = height_tercile)
histogram(x = weight, bins = 20, data = ansur, 
          facet = c(height_tercile, age_tercile), facet_type = 'grid')

Make a interaction line plot

Description

Make a interaction line plot

Usage

interaction_line_plot(y, x, by, data, ylim = NULL, xlab = NULL, ylab = NULL)

Arguments

y

A continuous variable to be plotted along the y-axis

x

A continuous variable to be plotted along the x-axis

by

A categorical variable by which we split the data and create one line plot for each resulting group

data

A data frame with the x, y, by variables

ylim

A vector of limits for the y-axis

xlab

The label of the x-axis (defaults to the x variable name).

ylab

The label of the y-axis (defaults to the y variable name).

Value

A ggplot2::ggplot object, which may be modified with further ggplot2 commands.

Examples

interaction_line_plot(y = score, x = time, by = treatment, 
                      data = selfesteem2_long, ylim = c(70, 100))
interaction_line_plot(y = score, x = time, by = treatment, 
                      data = selfesteem2_long, 
                      xlab = 'measurement time',
                      ylab = 'self esteem score',
                      ylim = c(70, 100))

Job Satisfaction Data for Two-Way ANOVA

Description

Contains the job satisfaction score organized by gender and education level. This data set was taken from the datarium R package.

Usage

data("jobsatisfaction")

Format

A data frame with 58 rows and 3 columns.

Examples

data(jobsatisfaction)
jobsatisfaction

Paired samples t-test

Description

A wrapper to stats::t.test() with paired = TRUE.

Usage

paired_t_test(y1, y2, data, ...)

Arguments

y1

A numeric vector of observations

y2

A numeric vector of observations, with each value of y2 is assumed to be paired, such as by repeated measures, the corresponding value of y1.

data

A data frame with y1 and y2 as values.

...

Additional arguments passed to stats::t.test().

Value

A list with class "htest" as returned by stats::t.test().

Examples

paired_t_test(y1, y2, data = pairedsleep)

Paired sleep data

Description

Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.

Usage

pairedsleep

Format

A data frame with 10 observations on the following 3 variables.

ID

The patient ID.

y1

The increase in hours, relative to control, for drug 1.

y2

The increase in hours, relative to control, for drug 2.

Source

This data is a transformed version of datasets::sleep.


A pairs plot

Description

This is a wrapper to the GGally based pairs plot of a list of variables displayed as scatterplots for pairs of continuous variables, density functions in the diagonal, and boxplots for pairs of continuous and categorical variables. Optionally, a by categorical variable can be provided.

Usage

pairs_plot(variables, data, by = NULL)

Arguments

variables

A vector of variable names

data

The data frame.

by

An optional variable, usually categorical (factor or character), by which the data are grouped and coloured.

Value

A GGally::ggpairs plot.

Examples

# A simple pairs plot
pairs_plot(variables = c("sex_dimorph", "attractive"),
data = faithfulfaces)
# A pairs plot with grouping variable
pairs_plot(variables = c("sex_dimorph", "attractive"),
by = face_sex,
data = faithfulfaces)

Pairwise t-test

Description

This is wrapper to the pairwise.t.test function. The p-value adjustment is "bonferroni" by default. Other possible values are "holm", "hochberg", "hommel", "BH", "BY", "fdr", "none". See stats::p.adjust().

Usage

pairwise_t_test(formula, data, p_adj = "bonferroni")

Arguments

formula

A two sided formula with one variable on either side, e.g. y ~ x, where the left hand side, dependent, variable is a numeric variable in data and the right hand side, independent, variable is a categorical or factor variable in data.

data

A data frame that contains the dependent and independent variables.

p_adj

The p-value adjustment method (see Description).

Value

An object of class pairwise.htest as returned by stats::pairwise.t.test().

Examples

data_df <- dplyr::mutate(vizverb, IV = interaction(task, response))
pairwise_t_test(time ~ IV, data = data_df)

Recode specified values by new values

Description

Recode specified values by new values

Usage

re_code(x, from, to)

Arguments

x

A vector, including column of data frame

from

The set of old values to be replaced by new ones

to

The set of new values to replace the old ones

Value

A vector that is the input vector but with old values replaced by new ones.

Examples

# Replace any occurrence of 1 and 2 with 101 and 201, respectively
x <- c(1, 2, 3, 4, 5, 1, 2)
re_code(x, from = c(1, 2), to = c(101, 201))

Remove an additional header row from a data frame

Description

Remove the first row of a data frame assuming that row was essentially a second (and redundant) header row in the original raw data file. After that row is removed, the data frame is reparsed to reinfer the data-types of each column.

Usage

remove_double_header(data_df)

Arguments

data_df

A data frame where it is assumed that the first row provides redundant header information and so it needs to be removed.

Details

Some software, including Qualtrics (survey software) and Gorilla (behavioural experiment software), sometimes export their data where the first two rows are both essentially headers, i.e., column labels. These two rows are not identical and often the second is redundant and so needs to be skipped. Data import functions like read_csv, and many others, do not let you skip the second row if the first row is not skipped. On the other hand, it is easy to read in all the data as per usual and then use, for example, slice, to remove the second row in the original. For example, slice(data_df, -1) will remove the first row in the data frame named data_df, which would be the second row of the original data file (assuming, as is common, that the first row of the original was used as the header to create the column names).

Although removing one row is easy to accomplish using basic tools in R, the bigger problem is that when the data was originally imported, it probably parsed all columns as character vectors. This is because the presence of header information in the second row of the original data, which are usually parsed as strings, forced the parser in a function like read_csv to parse the whole column as a character vector. After that second header row is removed, all the columns still remain as character vectors even though they could be, numeric, logical, etc. It is possible to use, for example, mutate and across to recode these columns, but that is not always possible with one simple command.

An alternative approach is, after the header row is removed, to reparse all the columns to infer their data types and then automatically recode them. This is what is done in this function. The parser that is used is the one used by readr.

Note that this reparsing is no more, or no less, foolproof than what happens when we ever use, for example, read_csv to import data without specifying explicitly the data type for each column, which is commonly done. Given this, it is wise to check the new data types to make sure that there are no errors.

Value

A new data frame where the data types of all columns were re-inferred after the first row was removed.

Examples

double_headered_csv <- '
a,b,c
x,x,x
1,2024/12/27,TRUE
2,2024/12/17,TRUE
3,2024/12/27,FALSE
'
readr::read_csv(double_headered_csv) |>
  remove_double_header()

Rename selected columns as a sequence

Description

This function will rename a selection of columns as, for example, var_1, var_2, var_2 ... var_10, where the prefix, var in this example, is arbitrary.

Usage

rename_with_seq(data_df, col_selector, prefix = "var")

Arguments

data_df

A data frame

col_selector

A tidy selector, e.g. contains('foo'), ends_with('bar').

prefix

The prefix for the sequence, e.g. 'drug' to produce names like drug_1, drug_2 etc.

Details

If we had, for example, a data frame where columns were the names of drugs and we wanted to rename these columns something like drug_1, drug_2, ..., this would be easy to do with rename if there were just a few columns to rename. When there are more than just a few, individual renaming is somewhat tedious and error prone. We can use rename_with to do this in one operation. However, the code for doing so is not very simple and would require some proficiency in R and tidyverse. This function is essentially just a wrapper to a rename_with function to allow the renaming to be done in one simple command.

Value

A data frame with renamed columns

Examples

data_df <- readr::read_csv('
subject, age, gender, Aripiprazole, Clozapine, Olanzapine, Quetiapine
A, 27, F, 20, 10, 40, 25
B, 23, M, 21, 21, 35, 27
')

rename_with_seq(data_df, col_selector = Aripiprazole:Quetiapine, prefix = 'drug')

A two dimensional scatterplot

Description

This function is a wrapper around the typical ggplot command to create two dimensional scatterplots, i.e. using geom_point. It provides the option of colouring point by a third variable, one that is usually, though not necessarily categorical. Also, it provides the option of placing the line of best fit on the scatterplot. If points are coloured by a categorical variable, the a different line of best for each value of the categorical variable is provided.

Usage

scatterplot(
  x,
  y,
  data,
  by = NULL,
  best_fit_line = FALSE,
  xlab = NULL,
  ylab = NULL
)

Arguments

x

A numeric variable in data. Its values are plotted on the x axis.

y

A numeric variable in data. Its values are plotted on the y axis.

data

A data frame with the x and y variables.

by

An optional variable, usually categorical (factor or character), by which the points in the scatterplot are byed and coloured.

best_fit_line

A logical variable indicating if the line of best fit should shown or not.

xlab

The label of the x-axis (defaults to the x variable name).

ylab

The label of the y-axis (defaults to the y variable name).

Value

A ggplot2::ggplot object, which may be modified with further ggplot2 commands.

Examples

scatterplot(x = attractive, y = trustworthy, data = faithfulfaces)
scatterplot(x = attractive, y = trustworthy, data = faithfulfaces,
            xlab = 'attractiveness', ylab = 'trustworthiness')
scatterplot(x = attractive, y = trustworthy, data = faithfulfaces,
            by = face_sex)
scatterplot(x = trustworthy, y = faithful, data = faithfulfaces,
            by = face_sex, best_fit_line = TRUE)

Make a scatterplot matrix

Description

Make a scatterplot matrix

Usage

scatterplot_matrix(.data, ..., .by = NULL, .bins = 10)

Arguments

.data

A data frame

...

A comma separated list of tidyselections of columns. This can be as simple as a set of column names.

.by

An optional categorical variable by which to group and colour the points.

.bins

The number of bins in the histograms on diagonal of matrix.

Value

A GGally::ggpairs plot.

Examples

data_df <- test_psychometrics %>%
              total_scores(x = starts_with('x_'), 
                           y = starts_with('y_'), 
                           z = starts_with('z_'))
scatterplot_matrix(data_df, x, y, z)

Age of Onset of Schizophrenia Data

Description

Data on sex differences in the age of onset of schizophrenia.

Usage

schizophrenia

Format

A data frame with 251 observations on the following 2 variables.

age

Age at the time of diagnosis.

gender

A categorical variable with values female and male

Details

A sex difference in the age of onset of schizophrenia was noted by Kraepelin (1919). Subsequently epidemiological studies of the disorder have consistently shown an earlier onset in men than in women. One model that has been suggested to explain this observed difference is known as the subtype model which postulates two type of schizophrenia, one characterised by early onset, typical symptoms and poor premorbid competence, and the other by late onset, atypical symptoms, and good premorbid competence. The early onset type is assumed to be largely a disorder of men and the late onset largely a disorder of women.

Source

This data set was taken from the HSAUR R package. From the description in that package, the original is E. Kraepelin (1919), Dementia Praecox and Paraphrenia. Livingstone, Edinburgh.


Self-Esteem Score Data for One-way Repeated Measures ANOVA

Description

The dataset contains 10 individuals' self-esteem score on three time points during a specific diet to determine whether their self-esteem improved.

One-way repeated measures ANOVA can be performed in order to determine the effect of time on the self-esteem score.

This data set was taken from the datarium R package.

Usage

data("selfesteem")

Format

A data frame with 10 rows and 4 columns.

Examples

data(selfesteem)
selfesteem

Self Esteem Score Data for Two-way Repeated Measures ANOVA

Description

Data are the self esteem score of 12 individuals enrolled in 2 successive short-term trials (4 weeks) - control (placebo) and special diet trials.

The self esteem score was recorded at three time points: at the beginning (t1), midway (t2) and at the end (t3) of the trials.

The same 12 participants are enrolled in the two different trials with enough time between trials.

Two-way repeated measures ANOVA can be performed in order to determine whether there is interaction between time and treatment on the self esteem score.

This data set was taken from the datarium R package.

Usage

data("selfesteem2")

Format

A data frame with 24 rows and 5 columns.

Examples

data(selfesteem2)
selfesteem2

Self Esteem Score Data for Two-way Repeated Measures ANOVA: Long format

Description

Data are the self esteem score of 12 individuals enrolled in 2 successive short-term trials (4 weeks) - control (placebo) and special diet trials.

The self esteem score was recorded at three time points: at the beginning (t1), midway (t2) and at the end (t3) of the trials.

The same 12 participants are enrolled in the two different trials with enough time between trials.

Two-way repeated measures ANOVA can be performed in order to determine whether there is interaction between time and treatment on the self esteem score.

This data set was converted from the selfesteem2 data taken from the datarium R package.

Usage

data("selfesteem2_long")

Format

A data frame with 72 rows and 4 columns.

id

Unique ID of the person

treatment

Binary variable indicating the treatment condition: Diet or ctr.

time

A categorical variable indicating the time of measurement: beginning (t1), midway (t2) and at the end (t3)

score

Self-esteem score

Examples

data(selfesteem2_long)
selfesteem2_long

Shapiro-Wilk normality test

Description

This function is a wrapper around stats::shapiro.test(). It implements the Shapiro-Wilk test that tests the null hypothesis that a sample of values is a sample from a normal distribution. Thie function can be applied to single vectors or groups of vectors.

Usage

shapiro_test(y, by = NULL, data)

Arguments

y

A numeric variable whose normality is being tested.

by

An optional grouping variable

data

A data frame containing y and the by variable

Value

A tibble data frame with one row for each value of the by variable, or one row overall if there is no by variable. For the y variable whose normality is being tested, for each subset of values corresponding to the values of they by variable, or for all values if there is no by variable, return the Shapiro-Wilk statistic, and the corresponding p-value.

Examples

shapiro_test(faithful, data = faithfulfaces)
shapiro_test(faithful, by = face_sex, data = faithfulfaces)

Descriptive statistics for variables with missing values

Description

Most descriptive statistic function like base::sum(), base::mean(), stats::median(), etc., do not skip NA values when computing the results and so always return NA if there is at least one NA in the input vector. The NA values can be skipped always by setting the na.rm argument to TRUE. While this is simply to do usually, in some cases, such as when a function is being passed to another function, setting na.rm = TRUE in that function requires creating a new anonymous function. The functions here, which all end in ⁠_xna⁠, are wrappers to common statistics functions, but with na.rm = TRUE.

Usage

sum_xna(...)

mean_xna(...)

median_xna(...)

iqr_xna(...)

sd_xna(...)

var_xna(...)

Arguments

...

Arguments to a descriptive statistic function

Value

A numeric vector, usually with one element, that provides the result of a descriptive statistics function applied to a vector after the NA values have been removed.

Functions

  • mean_xna(): The arithmetic mean for vectors with missing values.

  • median_xna(): The median for vectors with missing values.

  • iqr_xna(): The interquartile range for vectors with missing values.

  • sd_xna(): The standard deviation for vectors with missing values.

  • var_xna(): The variance for vectors with missing values.

Examples

set.seed(10101)
# Make a vector of random numbers
x <- runif(10, min = 10, max = 20)
# Concatenate with a NA value
x1 <- c(NA, x)
sum(x)
sum(x1) # Will be NA
sum_xna(x1) # Will be same as sum(x)
stopifnot(sum_xna(x1) == sum(x))
stopifnot(mean_xna(x1) == mean(x))
stopifnot(median_xna(x1) == median(x))
stopifnot(iqr_xna(x1) == IQR(x))
stopifnot(sd_xna(x1) == sd(x))
stopifnot(var_xna(x1) == var(x))

Independent samples t-test

Description

A wrapper to stats::t.test() with var.equal = TRUE.

Usage

t_test(formula, data)

Arguments

formula

A two sided formula with one variable on either side, e.g. y ~ x, where the left hand side, dependent, variable is a numeric variable in data and the right hand side, independent, variable is a categorical or factor variable in data, and which has only two distinct values.

data

A data frame that contains the dependent and independent variables.

Value

A list with class "htest" as returned by stats::t.test().

Examples

t_test(trustworthy ~ face_sex, data = faithfulfaces)

Psychometrics raw data from testing or demo purposes

Description

Typical psychometrics raw data files have multiple psychometric variables (scales), each with multiple constituent items. In this data set, there are three psychometric variables, each with 10 constituent items. The variables can be labelled x, y, and z. The constituent items of x, y and z are ⁠x_1, x_2 ... x_10⁠, ⁠y_1, y_2 ... y_10⁠, ⁠z_1, z_2 ... z_10⁠, respectively.

Usage

data('test_psychometrics')

Format

A data frame with 44 rows and 30 columns

Examples

data(test_psychometrics)
test_psychometrics

Format Numeric Columns to Fixed Digits

Description

This function formats specified numeric columns in a data frame to a fixed number of decimal places.

Usage

to_fixed_digits(data, ..., .digits = 3)

Arguments

data

A data frame or tibble containing the columns to format.

...

<tidy-select> Columns to apply the fixed digit formatting to. If no columns are specified, all numeric columns are selected.

.digits

An integer specifying the number of decimal places to format to. Default is 3.

Details

Tibble data frames display numeric values to a certain number of significant figures, determined by the pillar.sigfig option. Sometimes it may be useful or necessary to see values to a fixed number of digits. This can be accomplished with num. This function is a convenience function that applies num to all, or a specified subset, of the numeric vectors in a tibble.

Value

A data frame with the selected numeric columns formatted to the specified number of decimal places.

Examples

# Format all numeric columns to 3 decimal places
mtcars_df <- tibble::as_tibble(mtcars)
to_fixed_digits(mtcars_df)

# Format columns mpg to qsec to 3 decimal places
to_fixed_digits(mtcars_df, mpg:qsec)

# Format specific columns to 2 decimal places
to_fixed_digits(mtcars_df, mpg, hp, .digits = 2)

Calculate the total scores from sets of scores

Description

Calculate the total scores from sets of scores

Usage

total_scores(.data, ..., .method = "mean", .append = FALSE, .drop = FALSE)

Arguments

.data

A data frame with columns to summed or averaged over.

...

A comma separated set of named tidy selectors, each of which selects a set of columns to which to apply the totalling function.

.method

The method used to calculate the total. Must be one of "mean", "sum", or "sum_like". The "mean" is the arithmetic mean, skipping missing values. The "sum" is the sum, skipping missing values. The "sum_like" is the arithmetic mean, again skipping missing values, multiplied by the number of elements, including missing values.

.append

logical If FALSE, just the totals be returned. If TRUE, the totals are appended as new columns to original data frame.

.drop

logical If .append is TRUE, and if .drop is TRUE, then the variables being aggregated over are not returned.

Value

A new data frame with columns representing the total scores.

Examples

# Calculate the mean of all items beginning with `x_` and separately all items beginning with `y_`
total_scores(test_psychometrics, x = starts_with('x_'), y = starts_with('y_'))
# Calculate the sum of all items beginning with `z_` and separately all items beginning with `x_`
total_scores(test_psychometrics, .method = 'sum', z = starts_with('z_'), x = starts_with('x_'))
# Calculate the mean of all items from `x_1` to `y_10`
total_scores(test_psychometrics, xy = x_1:y_10)
# Calculate the mean of all items beginning with `x_` and separately all items beginning with `y_`,
# but append these means to the original, after have dropping the variables that
# are aggregated over
total_scores(test_psychometrics, x = starts_with('x_'), y = starts_with('y_'), .append = T, .drop = T)

A Tukey box-and-whisker plot

Description

This function is a wrapper around a typical ggplot based box-and-whisker plot, i.e. using geom_boxplot, which implements the Tukey variant of the box-and-whisker plot. The y variable is the outcome variable whose distribution is represented by the box-and-whisker plot. If the x variable is missing, then a single box-and-whisker plot using all values of y is shown. If an x variable is used, this is used an the independent variable and one box-and-whisker plot is provided for each set of y values that correspond to each unique value of x. For this reason, x is usually a categorical variable. If x is a continuous numeric variable, it ideally should have relatively few unique values, so that each value of x corresponds to a sufficiently large set of y values.

Usage

tukeyboxplot(
  y,
  x,
  data,
  by = NULL,
  jitter = FALSE,
  box_width = 1/3,
  jitter_width = 1/5,
  xlab = NULL,
  ylab = NULL
)

Arguments

y

The outcome variable

x

The optional independent/predictor/grouping variable

data

The data frame with the y and (optionally) x values.

by

An optional variable, usually categorical (factor or character), by which the points in the box-and-whisker plots are grouped and coloured.

jitter

A logical variable, defaulting to FALSE, that indicates if all points in each box-and-whisker plot should be shown as jittered points.

box_width

The width of box in each box-and-whisker plot. The default used, box_width = 1/3, means that boxes will be relatively narrow.

jitter_width

The width of the jitter relative to box width. For example, set jitter_width = 1 if you want the jitter to be as wide the box.

xlab

The label of the x-axis (defaults to the x variable name).

ylab

The label of the y-axis (defaults to the y variable name).

Value

A ggplot2::ggplot object, which may be modified with further ggplot2 commands.

Examples

# A single box-and-whisker plot
tukeyboxplot(y = time, data = vizverb)
# One box-and-whisker plot for each value of a categorical variable
tukeyboxplot(y = time, x = task, data = vizverb)
# Box-and-whisker plots with jitters
tukeyboxplot(y = time, x = task, data = vizverb,  jitter = TRUE)
# `tukeyboxplot` can be used with a continuous numeric variable too
tukeyboxplot(y = len, x = dose, data = ToothGrowth)
tukeyboxplot(y = len, x = dose, data = ToothGrowth,
             by = supp, jitter = TRUE, box_width = 0.5, jitter_width = 1)

Visual versus Verbal Perception and Responses

Description

An experiment studying the interaction between visual versus perception and visual versus verbal responses.

Usage

vizverb

Format

A data frame with 80 observations on the following 5 variables.

subject

Subject identifying number (s1 to s20)

task

Describe a diagram (visual) or a sentence (verbal)

response

Point response (visual) or say response (verbal)

time

Response time (in seconds)

Details

Subjects carried out two kinds of tasks. One task was visual (describing a diagram), and the other was classed as verbal (reading and describing a sentence sentences). They reported the results either by pointing (a "visual" response), or speaking (a verbal response). Time to complete each task was recorded in seconds.

Source

This data set was taken from the Stats2Data R package. From the description in that package, the original data appear to have been collected in a Mount Holyoke College psychology class based replication of an experiment by Brooks, L., R. (1968) "Spatial and verbal components of the act of recall," Canadian J. Psych. V 22, pp. 349 - 368.