Mixture model prediction and annotations

Statistics stat_distrmix_line() and stat_distrmix_eq() fit a Normal mixture model. While stat_distrmix_line() adds prediction lines, stat_distrmix_eq() adds textual labels to a plot.

Usage

stat_distrmix_eq(
  mapping = NULL,
  data = NULL,
  geom = "text_npc",
  position = "identity",
  ...,
  orientation = NA,
  method = "normalmixEM",
  method.args = list(),
  n.min = 10L * k,
  level = 0.95,
  k = 2,
  free.mean = TRUE,
  free.sd = TRUE,
  se = FALSE,
  fit.seed = NA,
  fm.values = TRUE,
  components = NULL,
  eq.with.lhs = TRUE,
  eq.digits = 2,
  label.x = "left",
  label.y = "top",
  hstep = 0,
  vstep = NULL,
  output.type = NULL,
  na.rm = FALSE,
  parse = NULL,
  show.legend = NA,
  inherit.aes = TRUE
)

stat_distrmix_line(
  mapping = NULL,
  data = NULL,
  geom = "line",
  position = "identity",
  ...,
  orientation = NA,
  method = "normalmixEM",
  se = NULL,
  fit.seed = NA,
  fm.values = FALSE,
  n = min(100 + 50 * k, 300),
  fullrange = TRUE,
  level = 0.95,
  method.args = list(),
  k = 2,
  free.mean = TRUE,
  free.sd = TRUE,
  components = "all",
  n.min = 10L * k,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

Arguments

mapping: The aesthetic mapping, usually constructed with aes(). Only needs to be set at the layer level if you are overriding the plot defaults.
data: A layer specific dataset, only needed if you want to override the plot defaults.
geom: The geometric object to use display the data
position: The position adjustment to use for overlapping points on this layer.
...: other arguments passed on to layer. This can include aesthetics whose values you want to set, not map. See layer for more details.
orientation: character Either "x" or "y" controlling the aesthetic to which the density model is fit. With the default orientation = NA the orientation used is based on the mapping and nearly always correct.
method: function or character If character, "normalmixEM" or the name of a model fit function are accepted, possibly followed by the fit function's method argument separated by a colon. The function must return a model fit object of class "mixEM".
method.args: named list with additional arguments. Not data or weights which are always passed through aesthetic mappings.
n.min: integer Minimum number of distinct values in the variable for fitting to the attempted. The default depends on k.
level: Level of confidence interval to use (0.95 by default).
k: integer Number of mixture components to fit.
free.mean, free.sd: logical If TRUE, allow the fitted mean and/or fitted sd to vary among the component Normal distributions.
se: logical If TRUE standard errors for the parameter estimates are returned in addition to the parameter estimates.
fit.seed: RNG seed argument passed to set.seed(). Defaults to NA, indicating that set.seed() should not be called.
fm.values: logical Add parameter estimates and their standard errors to the returned values (`FALSE` by default.)
components: character One of "all", "sum", or "members" select which densities are returned.
eq.with.lhs: If character the string is pasted to the front of the equation label before parsing or a logical (see note).
eq.digits: integer Number of digits after the decimal point to use for parameters in labels. If Inf, use exponential notation with three decimal places.
label.x, label.y: numeric with range 0..1 "normalized parent coordinates" (npc units) or character if using geom_text_npc() or geom_label_npc(). If using geom_text() or geom_label() numeric in native data units. If too short they will be recycled.
hstep, vstep: numeric in npc units, the horizontal and vertical step used between labels for different groups.
output.type: character One of "expression", "text", "markdown", "marquee", "latex", "latex.eqn", "latex.deqn" or "numeric".
na.rm: a logical indicating whether NA values should be stripped before the computation proceeds.
parse: logical Passed to the geom. If TRUE, the labels will be parsed into expressions and displayed as described in plotmath. Default is TRUE if output.type = "expression" and FALSE otherwise.
show.legend: logical. Should this layer be included in the legends? NA, the default, includes if any aesthetics are mapped. FALSE never includes, and TRUE always includes.
inherit.aes: If FALSE, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. borders.
n: Number of points at which to predict with the fitted model.
fullrange: logical Should the fit prediction span the full range of the plot, or just the range of the explanatory variable?

Value

The value returned by the statistic is a data frame, with n rows of predicted density for each component of the mixture plus their sum and the corresponding vector of x values. Optionally it will also include additional values related to the model fit.

The value returned by stat_distrmix_line() is a data frame, with n rows of predicted density for each component of the mixture plus their sum and the corresponding vector of x values.

The value returned by stat_distrmix_eq() is a data frame, with one row of estimates for each group of data in the plot.

Both statistics optionally also return additional values related to the model fit.

Details

stat_distrmix_line() is similar to stat_density but in addition to fitting a single distribution it can fit a mixture of two or more Normal distributions, using an approach related to clustering. Defaults are consistent between stat_distrmix_line() and stat_distrmix_eq(). stat_distrmix_eq() can be used to add matched textual annotations.

If k >= 2 a mixture of Normals model is fitted with normalmixEM(), while if k == 1 a single Normal distribution is fitted with function fitdistr(). Only for k == 1 the SE values are exact estimates.

Parameter fit.seed if not NA is used in a call to set.seed() immediately before calling the model fit function. As the fitting procedure makes use of the (pseudo-)random number generator (RNG), convergence can depend on it, and in such cases setting fit.seed to the same value in stat_distrmix_line() and in stat_distrmix_eq() can ensure consistency, and more generally, reproducibility.

The minimum number of observations with distinct values in the explanatory variable can be set through parameter n.min. The default depends on k, the number of components in the mix. Model fits with too few observations are unreliable, thus, using larger values of n.min than the default is wise.

Variables computed by `stat_distrmix_line()`

Some of the returned variables depend on the orientation.

x: the n values for the quantiles
component: A factor indexing the components and/or their sum

If fm.values = TRUE is passed then columns with diagnosis and parameters estimates are added, with the same value in each row within a group:

n: numeric the number of x values
.size: numeric the number of density values
fm.class: character the most derived class of the fitted model object
fm.method: character the method, as given by the ft field of the fitted model objects

This provides a simple and robust approach to achieve effects like colouring or hiding annotations by group depending on the outcome of model fitting.

Variables computed by `stat_distrmix_eq()`

Some of the variables depend on the orientation:

y: the location of text labels
eq.label: character string for equations
n.label: character string for number of observations
method.label: character string for model fit method
lambda: numeric the estimate of the contribution of the component of the mixture towards the joint density
mu: numeric the estimate of the mean
sigma: numeric the estimate of the standard deviation
component: A factor indexing the components of the mixture and/or their sum

If SE = TRUE is passed then columns with standard errors for the parameter estimates:

mu.se: numeric the estimate of the mean
sigma.se: numeric the estimate of the standard deviation

If fm.values = TRUE the same additional columns are returned as by stat_distrmix_eq(). This is wasteful of storage space as values are stored in multiple copies and, thus, disabled by default. However, it provides a simple and robust approach to achieve effects like colouring or hiding of the model fit line by group depending on the outcome of model fitting.

Output types

The formatting of character strings to be displayed in plots are marked as mathematical equations. Depending on the geom used, the mark-up needs to be encoded differently, or in some cases mark-up not applied.

"expression": The labels are encoded as character strings to be parsed into R's plotmath expressions.
"LaTeX", "TeX", "tikz", "latex": The labels are encoded as 'LaTeX' maths equations, without the "fences" for switching in math mode.
"latex.eqn": Same as "latex" but enclosed in single $, i.e., as in-line maths.
"latex.deqn": Same as "latex" but enclosed in double $$, i.e., as display maths.
"markdown": The labels are encoded as character strings using markdown syntax, with some embedded HTML.
"marquee": The labels are encoded as character strings using markdown syntax, with 'marquee' supported spans.
"text": The labels are plain ASCII character strings.
"numeric": No labels are generated. This value is accepted by the statistics, but not by the label formatting functions.
NULL: The value used depends on the argument passed to geom.

If geom = "latex" (package 'xdvir') the output type used is "latex.eqn". If geom = "richtext" (package 'ggtext') or geom = "textbox" (package 'ggtext') the output type used is "markdown". If geom = "marquee" (package 'marquee') the output type used is "marquee". For all other values of geom the default is "expression". Invalid values as argument trigger an error.

Position of labels

When data are grouped by mapping a factor to an aesthetic, e.g., colour, shape and/or linetype the model is fitted separately to each group, and for each group a whole set of labels is generated. If the argument passed to label.y is a vector of length 1, this value determines the position of the equation and/or other labels for the first group, and the positions of the labels for the remaining groups are generated by adding vspace based on the group number. If the argument passed to label.y is a vector of length > 1, it is used unchanged, possibly extended by recycling, ignoring vstep.

If the labels are rotated by 90 degrees then the automatic stepping is best based on hstep with vstep = 0. Similarly as described above, if label.x is a vector of length > 1, it is used unchanged, possibly extended by recycling, ignoring hstep.

When using facets and with a grouping that does not repeat in each panel, the automatic positioning in most cases will not be the desired one. Manual positioning using a vector of length > 1 for label.x and/or label.y is the currently available workaround.

Aesthetics

stat_distrmix_eq() understands the following aesthetics. Required aesthetics are displayed in bold and defaults are displayed for optional aesthetics:

•	`x` or `y`
•	`group`	→ `after_stat(component)`
•	`hjust`	→ `"inward"`
•	`label`	→ `after_stat(eq.label)`
•	`npcx`	→ `after_stat(npcx)`
•	`npcy`	→ `after_stat(npcy)`
•	`vjust`	→ `"inward"`

stat_distrmix_line() understands the following aesthetics. Required aesthetics are displayed in bold and defaults are displayed for optional aesthetics:

•	`x` or `y`
•	`group`	→ `after_stat(component)`
•	`weight`	→ `NULL`

Learn more about setting these aesthetics in vignette("ggplot2-specs").

Examples

ggplot(faithful, aes(x = waiting)) +
  stat_distrmix_line(components = "sum") +
  stat_distrmix_eq()


ggplot(faithful, aes(x = waiting)) +
  stat_distrmix_line(components = "sum") +
  stat_distrmix_eq(use_label("eq", "n", "method"))


ggplot(faithful, aes(x = waiting)) +
  stat_distrmix_line(components = "sum") +
  stat_distrmix_eq(geom = "label_npc")


ggplot(faithful, aes(x = waiting)) +
  stat_distrmix_line(components = "sum") +
  stat_distrmix_eq(geom = "text", label.x = "center", label.y = "bottom")


ggplot(faithful, aes(x = waiting)) +
  stat_distrmix_line(components = "sum") +
  stat_distrmix_eq(geom = "text", hjust = "inward")


ggplot(faithful, aes(x = waiting)) +
  stat_distrmix_line(components = "members") +
  stat_distrmix_eq(components = "members")


ggplot(faithful, aes(x = waiting)) +
  stat_distrmix_line(components = "members") +
  stat_distrmix_eq(components = "members", se = TRUE)


ggplot(faithful, aes(y = waiting)) +
  stat_distrmix_line(components = "sum") +
  stat_distrmix_eq(label.x = "right")


ggplot(faithful, aes(x = waiting)) +
 geom_histogram(aes(y = after_stat(density)), bins = 20) +
 stat_distrmix_line(aes(colour = after_stat(component),
                         fill = after_stat(component)),
                     geom = "area", linewidth = 1, alpha = 0.25) +
 stat_distrmix_eq(aes(colour = after_stat(component)))


ggplot(faithful, aes(x = waiting)) +
 stat_distrmix_line(aes(colour = after_stat(component),
                         fill = after_stat(component)),
                     geom = "area", linewidth = 1, alpha = 0.25,
                     components = "members") +
 stat_distrmix_eq(aes(colour = after_stat(component)),
                     components = "members")


ggplot(faithful, aes(x = waiting)) +
 stat_distrmix_line(geom = "area", linewidth = 1, alpha = 0.25,
                     colour = "black", outline.type = "upper",
                     components = "sum", se = FALSE) +
 stat_distrmix_eq(components = "sum")


# special case of no mixture
ggplot(subset(faithful, waiting > 66), aes(x = waiting)) +
  stat_distrmix_line(k = 1) +
  stat_distrmix_eq(k = 1)
#> With k = 1 one Normal distribution is fitted. Irrelevant parameters ignored!
#> With k = 1 one Normal distribution is fitted. Irrelevant parameters ignored!


ggplot(subset(faithful, waiting > 66), aes(x = waiting)) +
  stat_distrmix_line(k = 1) +
  stat_distrmix_eq(k = 1, se = TRUE)
#> With k = 1 one Normal distribution is fitted. Irrelevant parameters ignored!
#> With k = 1 one Normal distribution is fitted. Irrelevant parameters ignored!


# Inspecting the returned data using geom_debug_group()
gginnards.installed <- requireNamespace("gginnards", quietly = TRUE)

if (gginnards.installed)
  library(gginnards)

if (gginnards.installed)
  ggplot(faithful, aes(x = waiting)) +
    stat_distrmix_line(geom = "debug_group", components = "all")

#> [1] "PANEL 1; group(s) comp.1; 'draw_function()' input 'data' (head):"
#>           x component      density flipped_aes PANEL  group            y
#> 2  35.29541    comp.1 0.0001092383       FALSE     1 comp.1 0.0001092383
#> 5  35.61754    comp.1 0.0001306554       FALSE     1 comp.1 0.0001306554
#> 8  35.93967    comp.1 0.0001558017       FALSE     1 comp.1 0.0001558017
#> 11 36.26180    comp.1 0.0001852294       FALSE     1 comp.1 0.0001852294
#> 14 36.58392    comp.1 0.0002195534       FALSE     1 comp.1 0.0002195534
#> 17 36.90605    comp.1 0.0002594557       FALSE     1 comp.1 0.0002594557
#>    orientation
#> 2           NA
#> 5           NA
#> 8           NA
#> 11          NA
#> 14          NA
#> 17          NA
#> [1] "PANEL 1; group(s) comp.2; 'draw_function()' input 'data' (head):"
#>           x component      density flipped_aes PANEL  group            y
#> 3  35.29541    comp.2 9.599817e-15       FALSE     1 comp.2 9.599817e-15
#> 6  35.61754    comp.2 1.457556e-14       FALSE     1 comp.2 1.457556e-14
#> 9  35.93967    comp.2 2.206370e-14       FALSE     1 comp.2 2.206370e-14
#> 12 36.26180    comp.2 3.329836e-14       FALSE     1 comp.2 3.329836e-14
#> 15 36.58392    comp.2 5.010237e-14       FALSE     1 comp.2 5.010237e-14
#> 18 36.90605    comp.2 7.515967e-14       FALSE     1 comp.2 7.515967e-14
#>    orientation
#> 3           NA
#> 6           NA
#> 9           NA
#> 12          NA
#> 15          NA
#> 18          NA
#> [1] "PANEL 1; group(s) comp.sum; 'draw_function()' input 'data' (head):"
#>           x component      density flipped_aes PANEL    group            y
#> 1  35.29541  comp.sum 0.0001092383       FALSE     1 comp.sum 0.0001092383
#> 4  35.61754  comp.sum 0.0001306554       FALSE     1 comp.sum 0.0001306554
#> 7  35.93967  comp.sum 0.0001558017       FALSE     1 comp.sum 0.0001558017
#> 10 36.26180  comp.sum 0.0001852294       FALSE     1 comp.sum 0.0001852294
#> 13 36.58392  comp.sum 0.0002195534       FALSE     1 comp.sum 0.0002195534
#> 16 36.90605  comp.sum 0.0002594557       FALSE     1 comp.sum 0.0002594557
#>    orientation
#> 1           NA
#> 4           NA
#> 7           NA
#> 10          NA
#> 13          NA
#> 16          NA
    stat_distrmix_eq(geom = "debug_group", components = "all")
#> geom_debug_group: na.rm = FALSE
#> stat_distrmix_eq: method = normalmixEM, method.name = normalmixEM, se = FALSE, fit.seed = NA, level = 0.95, method.args = list(), k = 2, free.mean = TRUE, free.sd = TRUE, components = all, n.min = 20, eq.with.lhs = TRUE, eq.digits = 2, label.x = left, label.y = top, hstep = 0, vstep = 0.05, npc.used = FALSE, output.type = expression, na.rm = FALSE, orientation = NA, parse = TRUE
#> position_identity 

if (gginnards.installed)
  ggplot(faithful, aes(x = waiting)) +
    stat_distrmix_eq(geom = "debug_group", components = "sum")

#> [1] "PANEL 1; group(s) comp.sum; 'draw_function()' input 'data' (head):"
#>   lambda mu sigma  k converged   n fm.class   fm.method component
#> 1      1 NA    NA NA      TRUE 272    mixEM normalmixEM  comp.sum
#>                                                                                             eq.label
#> 1 DF~`=`~0.36 %*% italic(N)(mu*`=`*55, sigma*`=`*5.9) + 0.64 %*% italic(N)(mu*`=`*80, sigma*`=`*5.9)
#>     n.label          method.label npcx npcy PANEL    group
#> 1 n~`=`~272 "method: normalmixEM"   NA   NA     1 comp.sum
#>                                                                                                label
#> 1 DF~`=`~0.36 %*% italic(N)(mu*`=`*55, sigma*`=`*5.9) + 0.64 %*% italic(N)(mu*`=`*80, sigma*`=`*5.9)
#>      x    y orientation
#> 1 0.05 0.95          NA

if (gginnards.installed)
  ggplot(faithful, aes(x = waiting)) +
    stat_distrmix_eq(geom = "debug_group", components = "members")

#> [1] "PANEL 1; group(s) comp.1; 'draw_function()' input 'data' (head):"
#>      lambda       mu    sigma k converged   n fm.class   fm.method component
#> 1 0.3608869 54.61489 5.871242 2      TRUE 272    mixEM normalmixEM    comp.1
#>                                              eq.label   n.label
#> 1 DF~`=`~0.36 %*% italic(N)(mu*`=`*55, sigma*`=`*5.9) n~`=`~272
#>            method.label npcx    x npcy PANEL  group
#> 1 "method: normalmixEM"   NA 0.05   NA     1 comp.1
#>                                                 label    y orientation
#> 1 DF~`=`~0.36 %*% italic(N)(mu*`=`*55, sigma*`=`*5.9) 0.95          NA
#> [1] "PANEL 1; group(s) comp.2; 'draw_function()' input 'data' (head):"
#>      lambda       mu    sigma k converged   n fm.class   fm.method component
#> 2 0.6391131 80.09109 5.867718 2      TRUE 272    mixEM normalmixEM    comp.2
#>                                              eq.label   n.label
#> 2 DF~`=`~0.64 %*% italic(N)(mu*`=`*80, sigma*`=`*5.9) n~`=`~272
#>            method.label npcx    x npcy PANEL  group
#> 2 "method: normalmixEM"   NA 0.05   NA     1 comp.2
#>                                                 label   y orientation
#> 2 DF~`=`~0.64 %*% italic(N)(mu*`=`*80, sigma*`=`*5.9) 0.9          NA

if (gginnards.installed)
  ggplot(faithful, aes(x = waiting)) +
    stat_distrmix_eq(geom = "debug_group",
                      components = "members",
                      fm.values = TRUE)

#> [1] "PANEL 1; group(s) comp.1; 'draw_function()' input 'data' (head):"
#>      lambda       mu    sigma k converged   n fm.class   fm.method component
#> 1 0.3608873 54.61491 5.871254 2      TRUE 272    mixEM normalmixEM    comp.1
#>                                              eq.label   n.label
#> 1 DF~`=`~0.36 %*% italic(N)(mu*`=`*55, sigma*`=`*5.9) n~`=`~272
#>            method.label npcx    x npcy PANEL  group
#> 1 "method: normalmixEM"   NA 0.05   NA     1 comp.1
#>                                                 label    y orientation
#> 1 DF~`=`~0.36 %*% italic(N)(mu*`=`*55, sigma*`=`*5.9) 0.95          NA
#> [1] "PANEL 1; group(s) comp.2; 'draw_function()' input 'data' (head):"
#>      lambda      mu    sigma k converged   n fm.class   fm.method component
#> 2 0.6391127 80.0911 5.867709 2      TRUE 272    mixEM normalmixEM    comp.2
#>                                              eq.label   n.label
#> 2 DF~`=`~0.64 %*% italic(N)(mu*`=`*80, sigma*`=`*5.9) n~`=`~272
#>            method.label npcx    x npcy PANEL  group
#> 2 "method: normalmixEM"   NA 0.05   NA     1 comp.2
#>                                                 label   y orientation
#> 2 DF~`=`~0.64 %*% italic(N)(mu*`=`*80, sigma*`=`*5.9) 0.9          NA