| Title: | A Basic Set of Functions for Compositional Data Analysis |
|---|---|
| Description: | A minimum set of functions to perform compositional data analysis using the log-ratio approach introduced by John Aitchison (1982). Main functions have been implemented in c++ for better performance. |
| Authors: | Marc Comas-Cufí [aut, cre] (ORCID: <https://orcid.org/0000-0001-9759-0622>) |
| Maintainer: | Marc Comas-Cufí <[email protected]> |
| License: | GPL |
| Version: | 1.0.7 |
| Built: | 2026-05-12 14:26:09 UTC |
| Source: | https://github.com/mcomas/coda.base |
The 'alimentation' data set contains the percentage composition of food consumption in 25 European countries during the 1980s. The food categories are:
'RM': red meat (pork, veal, beef),
'WM': white meat (chicken),
'E': eggs,
'M': milk,
'F': fish,
'C': cereals,
'S': starch (potatoes),
'N': nuts,
'FV': fruits and vegetables.
The data set also contains categorical variables indicating whether the country belongs to the North or South/Mediterranean group, and whether it is an Eastern or Western European country.
alimentationalimentation
An object of class data.frame with 25 rows and 13 columns.
Construct the transformation matrix associated with additive log-ratio (alr) coordinates.
alr_basis(dim, denominator = NULL, numerator = NULL)alr_basis(dim, denominator = NULL, numerator = NULL)
dim |
Number of parts. It can be a single integer, a matrix or data frame, or a character vector of part names. |
denominator |
Part used as denominator. By default, the last part is used. |
numerator |
Parts used as numerators. By default, all parts except the denominator are used, preserving their original order. |
A matrix defining the alr coordinate system.
Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall, London.
alr_basis(5) alr_basis(5, 3) alr_basis(5, 3, c(1, 5, 2, 4))alr_basis(5) alr_basis(5, 3) alr_basis(5, 3, c(1, 5, 2, 4))
The 'arctic_lake' data set records the three-part composition
of 39 sediment samples collected at different water
depths in an Arctic lake.
arctic_lakearctic_lake
An object of class data.frame with 39 rows and 5 columns.
In humans, the main blood group systems are the ABO system, the Rh system, and the MN system. The MN blood system is related to proteins of the red blood cell plasma membrane. Its inheritance pattern is autosomal with codominance, meaning that the heterozygous phenotype is distinct from both homozygous phenotypes.
The three phenotypes are M, N, and MN. Their frequencies vary across populations. Under the Hardy-Weinberg principle, allele and genotype frequencies remain constant across generations in the absence of evolutionary forces, implying that
where and are the genotype frequencies of the
homozygotes and is the genotype frequency of heterozygotes.
blood_mnblood_mn
An object of class data.frame with 49 rows and 5 columns.
The 'bmi_activity' data set records the proportion of daily time spent in sleep ('sleep'), sedentary behaviour ('sedent'), light physical activity ('Lpa'), moderate physical activity ('Mpa'), and vigorous physical activity ('Vpa') for 393 children. The standardized body mass index ('zBMI') of each child is also included.
This data set was used in the example of Dumuid et al. (2019) to examine the expected differences in zBMI associated with reallocations of daily time between sleep, sedentary behaviour, and physical activity. Because the original data are confidential, 'bmi_activity' contains simulated data that mimic the main features of the original study.
bmi_activitybmi_activity
An object of class data.frame with 393 rows and 8 columns.
Dumuid, D., Pedisic, Z., Stanford, T. E., Martín-Fernández, J. A., Hron, K., Maher, C., Lewis, L. K., & Olds, T. S. (2019). The Compositional Isotemporal Substitution Model: a Method for Estimating Changes in a Health Outcome for Reallocation of Time between Sleep, Sedentary Behaviour, and Physical Activity. Statistical Methods in Medical Research, 28(3), 846–857.
Construct an ilr basis rotated according to canonical correlations between a compositional response data set and an explanatory data set.
cc_basis(Y, X)cc_basis(Y, X)
Y |
A compositional data set. |
X |
An explanatory data set. |
A matrix whose columns define a canonical-correlation-oriented ilr basis.
Construct the default isometric log-ratio basis used in CoDaPack.
cdp_basis(dim)cdp_basis(dim)
dim |
Number of parts. It can be a single integer, a matrix or data frame, or a character vector of part names. |
A matrix with rows and columns containing the
CoDaPack default ilr basis.
cdp_basis(5) cdp_basis(c("a", "b", "c", "d"))cdp_basis(5) cdp_basis(c("a", "b", "c", "d"))
Compute the default binary partition used in CoDaPack's software
cdp_partition(ncomp)cdp_partition(ncomp)
ncomp |
number of parts |
matrix
cdp_partition(4)cdp_partition(4)
Generic function to calculate the center of a compositional dataset
center(X, zero.rm = FALSE, na.rm = FALSE)center(X, zero.rm = FALSE, na.rm = FALSE)
X |
compositional dataset |
zero.rm |
a logical value indicating whether zero values should be stripped before the computation proceeds. |
na.rm |
a logical value indicating whether NA values should be stripped before the computation proceeds. |
X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) g = rep(c('a','b','c','d'), 25) center(X) (by_g <- by(X, g, center)) center(t(simplify2array(by_g)))X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) g = rep(c('a','b','c','d'), 25) center(X) (by_g <- by(X, g, center)) center(t(simplify2array(by_g)))
Applies the closure operation to a numeric vector, matrix or data frame so
that each composition sums to a prescribed constant k.
closure(X, k = 1)closure(X, k = 1)
X |
A numeric vector, matrix, data frame, or an object coercible to one of these. For matrices and data frames, rows are interpreted as compositions. |
k |
A numeric vector of length 1 or length |
If X is:
a vector, the returned vector sums to k;
a matrix or data frame, closure is applied row-wise, and each row
sums to the corresponding value of k.
The argument k may be:
a single positive number, recycled to all rows;
a numeric vector of length nrow(X), specifying a different
closure constant for each row.
For a composition with positive sum,
the closure to constant is
This function requires all entries of X to be finite and
non-negative, and every row sum (or the vector sum) must be strictly
positive.
If X is a vector, a numeric vector of the same length.
If X is a matrix, a numeric matrix with the same dimensions,
dimnames, and row-wise sums equal to k.
If X is a data frame, a data frame with the same row and column names,
and row-wise sums equal to k.
closure(c(2, 3, 5)) closure(c(2, 3, 5), k = 100) X <- matrix(c(1, 1, 2, 2, 3, 5), nrow = 2, byrow = TRUE) closure(X) closure(X, k = c(1, 100)) df <- data.frame(a = c(1, 2), b = c(1, 3), c = c(2, 5)) closure(df, k = 10)closure(c(2, 3, 5)) closure(c(2, 3, 5), k = 100) X <- matrix(c(1, 1, 2, 2, 3, 5), nrow = 2, byrow = TRUE) closure(X) closure(X, k = c(1, 100)) df <- data.frame(a = c(1, 2), b = c(1, 3), c = c(2, 5)) closure(df, k = 10)
Construct the transformation matrix associated with centered log-ratio (clr) coordinates.
clr_basis(dim)clr_basis(dim)
dim |
Number of parts. It can be a single integer, a matrix or data frame, or a character vector of part names. |
CLR coordinates are linearly dependent and lie in the
dimensional clr-plane.
A square matrix defining the clr coordinate system.
Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall, London.
B <- clr_basis(5) clr_coordinates <- coordinates(c(1, 2, 3, 4, 5), B) sum(clr_coordinates) < 1e-15B <- clr_basis(5) clr_coordinates <- coordinates(c(1, 2, 3, 4, 5), B) sum(clr_coordinates) < 1e-15
Performs imputation of missing values and/or values below the detection limit in compositional data using an EM algorithm assuming normality on the simplex.
coda_replacement( X, DL = NULL, dl_prop = 0.65, eps = 1e-04, parameters = FALSE, debug = FALSE, maxit = 500 )coda_replacement( X, DL = NULL, dl_prop = 0.65, eps = 1e-04, parameters = FALSE, debug = FALSE, maxit = 500 )
X |
A compositional data set: numeric matrix or data frame where rows represent observations and columns represent parts. |
DL |
An optional matrix or vector of detection limits. If 'NULL', the minimum non-zero value in each column of 'X' is used. |
dl_prop |
A numeric value between 0 and 1 used for initialization in the EM algorithm. |
eps |
Convergence tolerance. |
parameters |
Logical; if 'TRUE', return additional estimated parameters. |
debug |
Logical; if 'TRUE', print the log-likelihood at each iteration. |
maxit |
Maximum number of iterations |
If 'parameters = FALSE', the imputed object with the same format as 'X' ('matrix' or 'data.frame', preserving data-frame subclasses when possible) and preserving original names. If 'parameters = TRUE', a list with the estimated clr mean, clr covariance, and imputed clr coordinates.
X <- matrix(c( 0.00, 0.30, 0.70, 0.20, NA, 0.80, 0.40, 0.60, 0.00, 0.25, 0.25, 0.50, 0.10, 0.30, 0.60 ), ncol = 3, byrow = TRUE) colnames(X) <- c("sand", "silt", "clay") DL <- c(0.05, 0.05, 0.05) X_imp <- coda_replacement(X, DL = DL, maxit = 20) X_imp set.seed(10) X <- composition(matrix(rnorm(3*10), ncol = 3)) X[sample(c(TRUE, FALSE), 4*10, replace = TRUE, c(1, 3))] <- NA params <- coda_replacement(X, parameters = TRUE, debug = TRUE) names(params) params$clr_mu composition(params$clr_h)X <- matrix(c( 0.00, 0.30, 0.70, 0.20, NA, 0.80, 0.40, 0.60, 0.00, 0.25, 0.25, 0.50, 0.10, 0.30, 0.60 ), ncol = 3, byrow = TRUE) colnames(X) <- c("sand", "silt", "clay") DL <- c(0.05, 0.05, 0.05) X_imp <- coda_replacement(X, DL = DL, maxit = 20) X_imp set.seed(10) X <- composition(matrix(rnorm(3*10), ncol = 3)) X[sample(c(TRUE, FALSE), 4*10, replace = TRUE, c(1, 3))] <- NA params <- coda_replacement(X, parameters = TRUE, debug = TRUE) names(params) params$clr_mu composition(params$clr_h)
A minimum set of functions to perform compositional data analysis using the log-ratio approach introduced by John Aitchison (1982) <https://www.jstor.org/stable/2345821>. Main functions have been implemented in c++ for better performance.
Marc Comas-Cufí
Useful links:
Reconstruct a composition from coordinates with respect to a given basis.
composition(H, basis = "ilr") comp(H, basis = "ilr")composition(H, basis = "ilr") comp(H, basis = "ilr")
H |
Coordinates of a composition. It can be a numeric matrix, a data frame, or a numeric vector. |
basis |
Basis used to interpret the coordinates. Either a character string naming a predefined basis or a matrix. |
A composition corresponding to the given coordinates.
coordinates, ilr_basis, alr_basis,
clr_basis, sbp_basis
Compute orthonormal ilr bases adapted to row-wise conditioning patterns.
conditional_obasis(X, scheme = c("zero", "zero_na"))conditional_obasis(X, scheme = c("zero", "zero_na"))
X |
A numeric matrix or data frame with one observation or conditioning pattern per row and one part per column. |
scheme |
Character string indicating the conditioning scheme. Possible values are '"zero"' and '"zero_na"'. Default is '"zero"'. |
Each row of 'X' defines one conditioning pattern on the parts of a composition. According to 'scheme', the parts are split into ordered blocks:
'"zero"': parts equal to '0' and parts with strictly positive values,
'"zero_na"': missing values ('NA'), zeros, and strictly positive values.
For each row, the function constructs an orthonormal basis of the clr-plane preserving the block structure induced by the selected scheme.
Under 'scheme = "zero"', if a row contains 'nz' zeros, then:
the first 'nz - 1' coordinates describe the internal log-ratio structure of the zero block,
the coordinate 'nz' describes the balance between the zero block and the positive block,
the remaining coordinates describe the internal log-ratio structure of the positive block.
Under 'scheme = "zero_na"', the blocks are ordered as:
missing values ('NA'),
zeros,
strictly positive values.
In this case:
the first coordinates describe the internal structure of the 'NA' block,
the next coordinate contrasts the 'NA' block with the positive block,
the following coordinates describe the internal structure of the zero block,
the next coordinate contrasts the zero block with the positive block,
the remaining coordinates describe the internal structure of the positive block.
A three-dimensional array of dimension '(D - 1, D, nrow(X))', where 'D' is the number of parts. Each slice contains one orthonormal ilr basis.
C <- rbind( c(0, 0, 1, 1, 0), c(0, 1, 0, 1, 0) ) conditional_obasis(C) X <- rbind( c(1, NA, 0, 2), c(NA, 3, 0, 4), c(1, 2, 3, 4) ) conditional_obasis(X, scheme = "zero_na")C <- rbind( c(0, 0, 1, 1, 0), c(0, 1, 0, 1, 0) ) conditional_obasis(C) X <- rbind( c(1, NA, 0, 2), c(NA, 3, 0, 4), c(1, 2, 3, 4) ) conditional_obasis(X, scheme = "zero_na")
Compute coordinates of a composition or a compositional data set with respect to a given log-ratio basis.
The 'basis' argument can be either:
a character string identifying a predefined coordinate system, or
a matrix whose columns define a system of log-contrasts.
The predefined options are:
'"ilr"': isometric log-ratio coordinates,
'"olr"': orthonormal log-ratio coordinates,
'"clr"': centered log-ratio coordinates,
'"alr"': additive log-ratio coordinates,
'"pw"': pairwise log-ratios,
'"pc"': principal component log-ratio coordinates,
'"pb"': principal balance coordinates,
'"cdp"': CoDaPack default balances.
coordinates(X, basis = "ilr") coord(..., basis = "ilr") alr_c(X) clr_c(X) ilr_c(X) olr_c(X)coordinates(X, basis = "ilr") coord(..., basis = "ilr") alr_c(X) clr_c(X) ilr_c(X) olr_c(X)
X |
A compositional data set. It can be a numeric matrix, a data frame, or a numeric vector. |
basis |
Basis used to compute the coordinates. Either a character string naming a predefined basis or a matrix with log-ratio basis vectors in columns. |
... |
components of the composition |
Coordinates of 'X' with respect to the given 'basis'. The returned object has the same general type as the input when possible.
ilr_basis, alr_basis, clr_basis,
sbp_basis, composition
coordinates(1:5) B <- ilr_basis(5) coordinates(1:5, B) X <- rbind(1:5, 2:6) coordinates(X, "clr")coordinates(1:5) B <- ilr_basis(5) coordinates(1:5, B) X <- rbind(1:5, 2:6) coordinates(X, "clr")
Compute a distance matrix for compositional data, including the Aitchison
distance as an extension of dist.
dist(x, method = "euclidean", ...)dist(x, method = "euclidean", ...)
x |
A data matrix whose rows are compositions. |
method |
The distance measure to be used. This must be one of
|
... |
Additional arguments passed to |
An object of class "dist".
X <- exp(matrix(rnorm(10 * 50), ncol = 50, nrow = 10)) (d <- dist_coda(X, method = "aitchison")) plot(hclust(d)) # In contrast to Euclidean distance dist(rbind(c(1, 1, 1), c(100, 100, 100)), method = "euc") # Using Aitchison distance, only relative information is of importance dist_coda(rbind(c(1, 1, 1), c(100, 100, 100)), method = "ait")X <- exp(matrix(rnorm(10 * 50), ncol = 50, nrow = 10)) (d <- dist_coda(X, method = "aitchison")) plot(hclust(d)) # In contrast to Euclidean distance dist(rbind(c(1, 1, 1), c(100, 100, 100)), method = "euc") # Using Aitchison distance, only relative information is of importance dist_coda(rbind(c(1, 1, 1), c(100, 100, 100)), method = "ait")
Compute a distance matrix for compositional data using selected CoDa distances.
dist_coda(x, method = "aitchison", ...)dist_coda(x, method = "aitchison", ...)
x |
A data matrix whose rows are compositions. |
method |
The distance measure to be used. This must be one of
|
... |
Additional arguments. |
An object of class "dist".
Saperas-Riera, J.; Mateu-Figueras, G.; Martín-Fernández, J.A. (2024). Lp-Norm for Compositional Data: Exploring the CoDa L1-Norm in Penalised Regression. Mathematics, 12(9), 1388. doi:10.3390/math12091388.
set.seed(1) X <- exp(matrix(rnorm(10 * 5), ncol = 5, nrow = 10)) dist_coda(X, method = "aitchison") dist_coda(X, method = "L2") dist_coda(X, method = "L1") dist_coda(X, method = "L1-pw") dist_coda(X, method = "L1-clr")set.seed(1) X <- exp(matrix(rnorm(10 * 5), ncol = 5, nrow = 10)) dist_coda(X, method = "aitchison") dist_coda(X, method = "L2") dist_coda(X, method = "L1") dist_coda(X, method = "L1-pw") dist_coda(X, method = "L1-clr")
According to the three-sector theory, employment shifts from the primary sector (raw material extraction), to the secondary sector (industry, energy, and construction), and then to the tertiary sector (services) as economies develop. The 'eurostat_employment' data set contains EUROSTAT data on employment, aggregated for both sexes and all ages, distributed by economic activity in 2008 for 29 EUROSTAT member countries.
A related variable is the logarithm of gross domestic product per person in EUR at current prices ('logGDP'). For exploratory purposes, it is also categorised as a binary variable indicating values above or below the median ('Binary GDP').
The employment composition has 11 parts:
Primary sector
Manufacturing
Energy
Construction
Trade repair transport
Hotels restaurants
Financial intermediation
Real estate
Educ admin defense soc sec
Health social work
Other services
eurostat_employmenteurostat_employment
An object of class data.frame with 29 rows and 17 columns.
The 'foraminiferals' data set (Aitchison, 1986) is a classical example of paleocological compositional data. It contains the composition of four fossil types (Neogloboquadrina atlantica, Neogloboquadrina pachyderma, Globorotalia obesa, and Globigerinoides triloba) at 30 different depths.
Because the data contain rounded zeros, zero-replacement techniques are typically required before analysis. A natural goal is then to study the association between fossil composition and depth.
foraminiferalsforaminiferals
An object of class data.frame with 30 rows and 5 columns.
Simulate compositional data and optionally introduce structural zeros (interpreted as values below a detection limit) and missing values.
The function first generates a compositional data set 'X0', then creates a modified version 'X' by:
replacing values below 'dl_par' by zero, if 'zeros = TRUE',
introducing missing values at random, if 'missings = TRUE'.
A matrix of detection limits 'DL' is also returned. It contains 'dl_par' in the positions that were censored to zero, and '0' elsewhere.
gen_coda_with_zeros_and_missings( n, d, missings = TRUE, zeros = TRUE, dl_par = 0.05, na_p = 0.15 )gen_coda_with_zeros_and_missings( n, d, missings = TRUE, zeros = TRUE, dl_par = 0.05, na_p = 0.15 )
n |
Number of observations. |
d |
Dimension of the latent coordinate space used to generate the compositions. |
missings |
Logical; if 'TRUE', introduce missing values at random. |
zeros |
Logical; if 'TRUE', replace values below 'dl_par' by zero. |
dl_par |
Detection-limit threshold used to generate zeros. |
na_p |
Probability that any entry is replaced by 'NA' when 'missings = TRUE'. |
Compositions are generated from multivariate normal coordinates and mapped to the simplex through 'composition()'. The eigenvector rotation is included to induce a non-trivial covariance structure in the generated coordinates.
Missing values are introduced completely at random, independently for each cell, with probability 'na_p'.
A list with three components:
The generated compositional data set with simulated zeros and/or missing values.
A matrix of detection limits, with 'dl_par' in censored positions and '0' elsewhere.
The original simulated compositional data set before introducing zeros or missing values.
set.seed(123) sim <- gen_coda_with_zeros_and_missings(100, 4) str(sim) summary(sim$X0) summary(sim$X) table(sim$X == 0, useNA = "ifany")set.seed(123) sim <- gen_coda_with_zeros_and_missings(100, 4) str(sim) summary(sim$X0) summary(sim$X) table(sim$X == 0, useNA = "ifany")
Generic function for the (trimmed) geometric mean.
gmean(x, zero.rm = FALSE, trim = 0, na.rm = FALSE)gmean(x, zero.rm = FALSE, trim = 0, na.rm = FALSE)
x |
A nonnegative vector. |
zero.rm |
a logical value indicating whether zero values should be stripped before the computation proceeds. |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. |
na.rm |
a logical value indicating whether NA values should be stripped before the computation proceeds. |
The 'house_expend' data set, obtained from Eurostat, records the composition of mean household consumption expenditure across 12 expenditure categories in 27 European Union countries. Some values are rounded zeros.
In addition, the data set contains gross domestic product values for 2005 ('GDP05') and 2014 ('GDP14'). A relevant analysis is the relationship between expenditure compositions and GDP.
house_expendhouse_expend
An object of class data.frame with 27 rows and 15 columns.
In a sample survey of single persons living alone in rented accommodation, twenty men and twenty women were randomly selected and asked to record their expenditure over one month in the following four mutually exclusive and exhaustive commodity groups:
'Hous': housing, including fuel and light,
'Food': foodstuffs, including alcohol and tobacco,
'Serv': services, including transport and vehicles,
'Other': other goods, including clothing, footwear, and durable goods.
household_budgethousehold_budget
An object of class data.frame with 40 rows and 6 columns.
Construct an isometric log-ratio (ilr) basis for a composition with
parts. The ilr basis is an orthonormal basis of the clr-plane and
provides coordinates. The same basis is sometimes referred to as
an orthonormal log-ratio (olr) basis.
ilr_basis(dim, type = "default") olr_basis(dim, type = "default")ilr_basis(dim, type = "default") olr_basis(dim, type = "default")
dim |
Number of parts. It can be:
|
type |
Type of ilr basis to construct. Available options are:
|
For 'type = "default"', the function returns the standard Helmert-type ilr basis. Alternative constructions are available through 'type = "pivot"' and 'type = "cdp"'.
The default basis vectors are:
A matrix with rows and columns representing an
orthonormal log-ratio basis.
Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G., & Barceló-Vidal, C. (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3), 279–300.
ilr_basis(5) ilr_basis(alimentation[, 1:9]) ilr_basis(c("a", "b", "c", "d"), type = "pivot")ilr_basis(5) ilr_basis(alimentation[, 1:9]) ilr_basis(c("a", "b", "c", "d"), type = "pivot")
The 'kilauea_iki' data set contains the chemical composition of volcanic rocks sampled from the lava lake at Kilauea Iki (Hawaii). The data represent major oxide concentrations in fractional form.
kilauea_ikikilauea_iki
A data frame with 17 observations and 11 variables:
Silicon dioxide
Titanium dioxide
Aluminium oxide
Ferric oxide
Ferrous oxide
Manganese oxide
Magnesium oxide
Calcium oxide
Sodium oxide
Potassium oxide
Phosphorus pentoxide
The variability in oxide concentrations is attributed to magnesian olivine fractionation from a single magmatic mass, as suggested by Richter and Moore (1966).
Richter, D. H., & Moore, J. G. (1966). Petrology of Kilauea Iki lava lake, Hawaii. Geological Survey Professional Paper 537-B.
The 'mammals_milk' data set contains the percentages of five constituents of
the milk of 24 mammals:
,
where 'W' is water, 'P' is protein, 'F' is fat, 'L' is lactose, and 'A' is
ash.
mammals_milkmammals_milk
An object of class data.frame with 24 rows and 6 columns.
In an attempt to improve the quality of cow milk, milk from thirty cows was assessed before and after a controlled dietary and hormonal regime over eight weeks. A control group of thirty cows kept under the usual regime was also included.
The 'milk_cows' data set provides the complete before/after milk composition data for the sixty cows, with the proportions of protein ('pr'), milk fat ('mf'), carbohydrate ('ch'), calcium ('Ca'), sodium ('Na'), and potassium ('K').
milk_cowsmilk_cows
An object of class tbl_df (inherits from tbl, data.frame) with 116 rows and 10 columns.
The 'montana' data set contains 229 samples of the concentration (in ppm) of
five minor elements in coal ashes from the Fort
Union formation (Montana, USA), in the Powder River Basin.
The five measured elements form a fully observed subcomposition of a much
larger chemical composition. Since the data are given in parts per million and
all concentrations were measured, a residual component could in principle be
added to close the compositions to .
montanamontana
An object of class data.frame with 229 rows and 6 columns.
Construct the system of all pairwise log-ratios between parts.
pairwise_basis(dim)pairwise_basis(dim)
dim |
Number of parts. It can be a single integer, a matrix or data frame, or a character vector of part names. |
A matrix, or a sparse matrix for large dimensions, whose columns represent all pairwise log-ratio generators.
The 'parliament2017' data set contains the results of the 2017 Catalan Parliament election aggregated by region.
parliament2017parliament2017
A data frame with 42 rows and 9 variables:
Region
Votes for the Ciutadans party
Votes for the Junts per Catalunya party
Votes for the Esquerra Republicana de Catalunya party
Votes for the Partit Socialista de Catalunya party
Votes for the Catalunya Sí que es Pot party
Votes for the Candidatura d'Unitat Popular party
Votes for the Partit Popular party
Votes for other parties
Idescat, statistics on Catalan Parliament elections.
Builds a single grouped constrained principal balance from the first principal component of the grouped composition.
partial_pb_constrained(X, lI = NULL, constrained.criterion = "variance")partial_pb_constrained(X, lI = NULL, constrained.criterion = "variance")
X |
A numeric matrix with strictly positive finite entries. Rows are observations and columns are compositional parts. |
lI |
A list defining a partition of a subset of the columns of
|
constrained.criterion |
Criterion used to choose the constrained
balance. Either |
A list with the following elements:
dimDimension of the grouped problem, equal to
length(lI) - 1.
lIThe input grouping structure.
varianceVariance criterion of the selected grouped balance.
balance_rawInteger vector in describing
the selected grouped split.
balanceThe corresponding one-column balance basis.
constrained.criterionCriterion used to construct the balance.
Finds the grouped balance with maximum variance among all assignments whose
number of active groups is between min_parts and max_parts.
partial_pb_exact( X, lI = NULL, min_parts = 2, max_parts = NULL, method = "restricted" )partial_pb_exact( X, lI = NULL, min_parts = 2, max_parts = NULL, method = "restricted" )
X |
A numeric matrix with strictly positive finite entries. Rows are observations and columns are compositional parts. |
lI |
A list defining a partition of a subset of the columns of
|
min_parts |
Integer. Minimum number of active groups. |
max_parts |
Integer or |
method |
Exhaustive search method. Currently only |
The search enumerates only supports whose size is between
min_parts and max_parts. For each support, signs are generated
in binary Gray-code order, fixing the first active group on the left side to
avoid evaluating both a balance and its sign reversal.
A list with the following elements:
dimDimension of the grouped problem, equal to
length(lI) - 1.
lIThe input grouping structure.
varianceVariance criterion of the best grouped balance.
balance_rawInteger vector in describing
the best grouped split.
balanceThe corresponding one-column balance basis.
min_partsMinimum number of active groups.
max_partsMaximum number of active groups.
Finds a single grouped balance by tabu search over a partition of selected
parts. The search is carried out on groups of parts defined by lI,
using configurable neighbourhood moves.
partial_pb_tabu_search( X, lI = NULL, min_parts = 2, max_parts = NULL, iter = 100, tabu_size = length(lI), ini = NULL, remove_active = TRUE, add_left = TRUE, add_right = TRUE, flip_side = FALSE, swap_zero = FALSE, swap_sides = FALSE, debug = FALSE, constrained.criterion = "variance" )partial_pb_tabu_search( X, lI = NULL, min_parts = 2, max_parts = NULL, iter = 100, tabu_size = length(lI), ini = NULL, remove_active = TRUE, add_left = TRUE, add_right = TRUE, flip_side = FALSE, swap_zero = FALSE, swap_sides = FALSE, debug = FALSE, constrained.criterion = "variance" )
X |
A numeric matrix with strictly positive finite entries. Rows are observations and columns are compositional parts. |
lI |
A list defining a partition of a subset of the columns of
|
min_parts |
Integer. Minimum number of active groups. |
max_parts |
Integer or |
iter |
Integer. Maximum number of tabu search iterations. |
tabu_size |
Integer. Maximum size of the tabu list. |
ini |
Initial grouped split. If |
remove_active |
Logical. Allow moves from |
add_left |
Logical. Allow moves from |
add_right |
Logical. Allow moves from |
flip_side |
Logical. Allow direct moves from |
swap_zero |
Logical. Allow swaps between one active group and one inactive group, preserving the active side. |
swap_sides |
Logical. Allow swaps between one left group and one right group. |
debug |
Logical. If |
constrained.criterion |
Criterion used to initialise the constrained
balance when |
When ini = NULL, the constrained grouped balance is adjusted greedily
so that the initial solution has exactly max_parts active groups.
A list with the selected balance, its variance criterion, the search
path, and a neighbourhoods element recording the active
neighbourhood types.
Construct a basis of principal balances for a compositional data set.
pb_basis( X, method, constrained.criterion = "variance", cluster.method = "ward.D2", ordering = TRUE, ... )pb_basis( X, method, constrained.criterion = "variance", cluster.method = "ward.D2", ordering = TRUE, ... )
X |
Compositional data set. |
method |
Method used to construct the principal balances. One of '"exact"', '"exact2"', '"constrained"', or '"cluster"'. |
constrained.criterion |
Criterion used by the constrained method. Either '"variance"' (default) or '"angle"'. |
cluster.method |
Linkage criterion passed to
|
ordering |
Logical; if 'TRUE', reorder balances by decreasing explained variance. |
... |
Additional arguments passed to |
Several methods are available:
'"exact"': exact computation of principal balances,
'"exact2"': exact computation using incremental Gray-code updates,
'"constrained"': constrained approximation based on a target criterion,
'"cluster"': approximation based on hierarchical clustering.
A matrix whose columns are principal balances.
Martín-Fernández, J. A., Pawlowsky-Glahn, V., Egozcue, J. J., & Tolosana-Delgado, R. (2018). Advances in Principal Balances for Compositional Data. Mathematical Geosciences, 50, 273–298.
set.seed(1) X <- matrix(exp(rnorm(5 * 100)), nrow = 100, ncol = 5) v1 <- apply(coordinates(X, "pc"), 2, var) v2 <- apply(coordinates(X, pb_basis(X, method = "exact")), 2, var) v3 <- apply(coordinates(X, pb_basis(X, method = "constrained")), 2, var) v4 <- apply(coordinates(X, pb_basis(X, method = "cluster")), 2, var) barplot( rbind(v1, v2, v3, v4), beside = TRUE, ylim = c(0, 2), legend = c( "Principal Components", "PB (Exact method)", "PB (Constrained)", "PB (Ward approximation)" ), names = paste0("Comp.", 1:4), args.legend = list(cex = 0.8), ylab = "Variance" )set.seed(1) X <- matrix(exp(rnorm(5 * 100)), nrow = 100, ncol = 5) v1 <- apply(coordinates(X, "pc"), 2, var) v2 <- apply(coordinates(X, pb_basis(X, method = "exact")), 2, var) v3 <- apply(coordinates(X, pb_basis(X, method = "constrained")), 2, var) v4 <- apply(coordinates(X, pb_basis(X, method = "cluster")), 2, var) barplot( rbind(v1, v2, v3, v4), beside = TRUE, ylim = c(0, 2), legend = c( "Principal Components", "PB (Exact method)", "PB (Constrained)", "PB (Ward approximation)" ), names = paste0("Comp.", 1:4), args.legend = list(cex = 0.8), ylab = "Variance" )
Builds a sequential binary partition (SBP) by repeatedly applying grouped
tabu search to select balances over the current sets of parts. At each step,
the best candidate split is retained and the remaining candidate subproblems
are explored until an SBP with balances is obtained.
pb_tabu_search(X, iter = 100, debug = FALSE)pb_tabu_search(X, iter = 100, debug = FALSE)
X |
A numeric matrix with strictly positive finite entries. Rows are observations and columns are compositional parts. |
iter |
Integer. Maximum number of tabu search iterations used in each partial search. |
debug |
Logical. If |
This function provides a heuristic approximation to a principal balance basis. The first balance is searched on the full set of parts, and the subsequent balances are obtained by recursively refining the best currently available split.
All partial searches are initialized with the constrained principal balance of the corresponding grouped composition.
The procedure starts from the trivial grouping where each part forms its own singleton group. After each partial tabu search, up to three candidate subproblems may be generated from the selected solution:
the split between active and inactive groups,
the left active branch,
the right active branch.
All generated candidates are stored, and at each stage the candidate with the largest variance criterion is selected for inclusion in the SBP and for further refinement.
This is a heuristic search strategy and does not guarantee a globally optimal SBP.
An integer matrix representing a sequential binary partition. Rows
correspond to the original parts of X and columns correspond to
balances. Entries are in . The returned matrix has attribute
"max_steps", giving the largest iteration index at which a best
partial solution was found among all partial searches performed.
partial_pb_tabu_search,
sbp_basis
set.seed(1) X <- matrix(rexp(500), ncol = 5) SBP <- pb_tabu_search(X, iter = 30) SBP attr(SBP, "max_steps")set.seed(1) X <- matrix(rexp(500), ncol = 5) SBP <- pb_tabu_search(X, iter = 30) SBP attr(SBP, "max_steps")
Construct an ilr basis rotated according to the principal components of the log-ratio coordinates of a compositional data set.
pc_basis(X)pc_basis(X)
X |
Compositional data set. |
A matrix whose columns define a principal-component-oriented ilr basis.
The perturbation operation combines two compositions by component-wise multiplication and then applies closure to ensure the result remains a valid composition.
perturbation(X, Y)perturbation(X, Y)
X |
A numeric vector, matrix or data.frame containing compositions. |
Y |
A numeric vector, matrix or data.frame with the same number of
parts as |
Perturbation is the analogue of addition in the simplex. Each part of
X is multiplied by the corresponding part of Y, and the result
is closed with closure so that each composition has constant
sum.
An object with the same format as X containing the perturbed
compositions, except that vector X with matrix or data.frame
Y returns the same rectangular format as Y.
x <- c(a = 1, b = 2, c = 3) y <- c(a = 1, b = 1, c = 2) perturbation(x, y) X <- rbind( c(1, 2, 3), c(4, 5, 6) ) perturbation(X, c(1, 1, 2)) perturbation(c(1, 1, 2), X)x <- c(a = 1, b = 2, c = 3) y <- c(a = 1, b = 1, c = 2) perturbation(x, y) X <- rbind( c(1, 2, 3), c(4, 5, 6) ) perturbation(X, c(1, 1, 2)) perturbation(c(1, 1, 2), X)
The 'petrafm' data set contains 100 classified volcanic rock samples from Ontario (Canada). The three-part composition is
Rocks from the calc-alkaline magma series (25 samples) can be distinguished from those of the tholeiitic magma series (75 samples) using an AFM diagram.
petrafmpetrafm
An object of class data.frame with 100 rows and 4 columns.
Plot a balance with node labels under horizontal branches
plot_balance( B, data = NULL, main = "Balance dendrogram", summary_fun = NULL, cex_node = 0.9, offset_node = 0.05, ... )plot_balance( B, data = NULL, main = "Balance dendrogram", summary_fun = NULL, cex_node = 0.9, offset_node = 0.05, ... )
B |
Balance basis matrix |
data |
Optional compositional data used to compute balance summaries |
main |
Plot title |
summary_fun |
Optional function applied to each balance coordinate vector. It must take a numeric vector and return a character string. |
cex_node |
Character expansion for node labels |
offset_node |
Vertical offset below the horizontal branch, relative to max height |
... |
Further arguments passed to plot |
Invisibly returns a data.frame with node coordinates and labels
X = waste[,5:9] B = pb_basis(X, method = 'exact') plot_balance(B) plot_balance(B, data = X, summary_fun = function(x){ q = quantile(x, probs = c(0.25, 0.5, 0.75)) sprintf("%0.2f [%0.2f-%0.2f]", q[2], q[1], q[3]) })X = waste[,5:9] B = pb_basis(X, method = 'exact') plot_balance(B) plot_balance(B, data = X, summary_fun = function(x){ q = quantile(x, probs = c(0.25, 0.5, 0.75)) sprintf("%0.2f [%0.2f-%0.2f]", q[2], q[1], q[3]) })
The 'pollen' data set contains 30 fossil pollen samples from three different
locations (recorded in variable 'group'). The measured composition is the
three-part composition .
pollenpollen
An object of class data.frame with 30 rows and 4 columns.
The 'pottery' data set contains the chemical composition of 45 specimens of Romano-British pottery. The measurements were obtained by atomic absorption spectrophotometry and include nine oxides: Al2O3, Fe2O3, MgO, CaO, Na2O, K2O, TiO2, MnO, and BaO.
The specimens come from five different kiln sites.
potterypottery
An object of class data.frame with 45 rows and 11 columns.
The powering operation raises each part of a composition to a scalar exponent and then applies closure to re-normalize the result as a composition.
powering(X, alpha)powering(X, alpha)
X |
A numeric vector, matrix or data.frame containing compositions. |
alpha |
A numeric scalar or vector. If |
Powering is the analogue of scalar multiplication in the simplex. Each part
is raised to alpha, and the result is closed with closure.
When alpha has one value per row, each composition is powered by its
corresponding value. When it has one value per part, each part receives its
corresponding exponent. For vector X and vector alpha, each row
of the result is X powered by the corresponding element of
alpha.
An object with the same format as X containing the powered
compositions, except that vector X with vector alpha returns
a matrix.
x <- c(a = 1, b = 2, c = 3) powering(x, 2) powering(x, c(1, 2)) X <- rbind( c(1, 2, 3), c(4, 5, 6) ) powering(X, c(1, 2))x <- c(a = 1, b = 2, c = 3) powering(x, 2) powering(x, c(1, 2)) X <- rbind( c(1, 2, 3), c(4, 5, 6) ) powering(X, c(1, 2))
Simulates a random composition whose coordinate system is constructed from
a sequential binary partition induced by a given first balance. The supplied
balance is completed to a full orthonormal basis using
sbp_basis with fill = TRUE.
random_composition_with_fixed_pb(principal_balance, n = 100, sd1 = 5)random_composition_with_fixed_pb(principal_balance, n = 100, sd1 = 5)
principal_balance |
An integer or numeric vector in
|
n |
Integer. Number of observations to generate. |
sd1 |
Numeric value used to scale the first latent coordinate before rotating the simulated coordinates. |
Standard normal latent coordinates are first generated in dimension
, where is the number of parts. Their sample covariance
matrix is then diagonalized, and the associated eigenvectors are used to
rotate the latent coordinates before mapping them back to the simplex using
the basis induced by principal_balance.
This function is mainly intended for examples, simulation studies, and experiments where a specific first balance structure is desired.
A composition matrix with n rows and
length(principal_balance) columns.
Import data from a codapack workspace
read_cdp(fname)read_cdp(fname)
fname |
cdp file name |
Construct a balance basis from a sequential binary partition (SBP) or from a more general collection of balances.
sbp_basis(sbp, data = NULL, fill = FALSE, silent = FALSE)sbp_basis(sbp, data = NULL, fill = FALSE, silent = FALSE)
sbp |
A list of formulas or a matrix describing balances. |
data |
Optional compositional data set used to extract part names when 'sbp' is given as a list of formulas. If 'data = NULL', part names are inferred from the formulas themselves. |
fill |
Logical; if 'TRUE', complete the supplied balances to obtain a full basis. |
silent |
Logical; if 'FALSE', report whether the resulting balances form a basis, and whether they are orthogonal or orthonormal. |
The argument 'sbp' can be specified in two ways:
as a list of formulas, where each formula defines the numerator and the denominator groups of a balance,
as a matrix with one column per balance and one row per part. Positive entries indicate parts in the numerator, negative entries indicate parts in the denominator, and zeros indicate unused parts.
A matrix whose columns are balances.
X <- data.frame( a = 1:2, b = 2:3, c = 4:5, d = 5:6, e = 10:11, f = 100:101, g = 1:2 ) # Sequential SBP construction sbp_basis(list( b1 = a ~ b + c + d + e + f + g, b2 = b ~ c + d + e + f + g, b3 = c ~ d + e + f + g, b4 = d ~ e + f + g, b5 = e ~ f + g, b6 = f ~ g ), data = X) # Chain construction sbp_basis(list( b1 = a ~ b, b2 = b1 ~ c, b3 = b2 ~ d, b4 = b3 ~ e, b5 = b4 ~ f, b6 = b5 ~ g ), data = X) # Formula parts can be inferred when data is not supplied sbp_basis(list( b1 = a ~ b, b2 = b1 ~ c )) # Non-orthogonal system of balances sbp_basis(list( b1 = a + b + c ~ e + f + g, b2 = d ~ a + b + c, b3 = d ~ e + g, b4 = a ~ e + b, b5 = b ~ f, b6 = c ~ g ), data = X) # Direct construction from a contrast matrix sbp_basis(cbind( c( 1, 1, -1, -1), c( 1, -1, 1, -1), c( 1, -1, -1, 1) ))X <- data.frame( a = 1:2, b = 2:3, c = 4:5, d = 5:6, e = 10:11, f = 100:101, g = 1:2 ) # Sequential SBP construction sbp_basis(list( b1 = a ~ b + c + d + e + f + g, b2 = b ~ c + d + e + f + g, b3 = c ~ d + e + f + g, b4 = d ~ e + f + g, b5 = e ~ f + g, b6 = f ~ g ), data = X) # Chain construction sbp_basis(list( b1 = a ~ b, b2 = b1 ~ c, b3 = b2 ~ d, b4 = b3 ~ e, b5 = b4 ~ f, b6 = b5 ~ g ), data = X) # Formula parts can be inferred when data is not supplied sbp_basis(list( b1 = a ~ b, b2 = b1 ~ c )) # Non-orthogonal system of balances sbp_basis(list( b1 = a + b + c ~ e + f + g, b2 = d ~ a + b + c, b3 = d ~ e + g, b4 = a ~ e + b, b5 = b ~ f, b6 = c ~ g ), data = X) # Direct construction from a contrast matrix sbp_basis(cbind( c( 1, 1, -1, -1), c( 1, -1, 1, -1), c( 1, -1, -1, 1) ))
The 'serprot' data set records the percentages of four serum proteins from blood samples of 30 patients. Fourteen patients have one disease and sixteen have another.
The four-part compositions are formed by
.
serprotserprot
An object of class data.frame with 36 rows and 7 columns.
The 'statistician_time' data set records the daily time budget of an academic statistician across 20 working days. The six activities are teaching ('T'), consultation ('C'), administration ('A'), research ('R'), other wakeful activities ('O'), and sleep ('S').
These activities may also be grouped into work ('T', 'C', 'A', 'R') and leisure ('O', 'S'). The data allow investigation of the relationship between detailed time-allocation patterns and the broader division between work and leisure.
statistician_timestatistician_time
An object of class data.frame with 20 rows and 7 columns.
Variation array is returned.
variation_array(X, include_means = FALSE, ml_covariance = FALSE)variation_array(X, include_means = FALSE, ml_covariance = FALSE)
X |
Compositional dataset |
include_means |
if TRUE logratio means are included in the lower-left triangle |
ml_covariance |
if TRUE Maximum-likelihood estimation of the covariance for the multivariate normal distribution is used (dividing the scatter matrix by n instead of n-1) |
variation array matrix
set.seed(1) X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) variation_array(X) variation_array(X, include_means = TRUE)set.seed(1) X = matrix(exp(rnorm(5*100)), nrow=100, ncol=5) variation_array(X) variation_array(X, include_means = TRUE)
The 'waste' data set studies the relationship between waste composition and floating population in Catalonia. The actual population of a municipality combines census population and floating population (tourists, seasonal visitors, temporary workers, and similar short-term residents), expressed as equivalent full-time residents.
The composition of urban solid waste is classified into five parts:
'x1': non-recyclable waste,
'x2': glass,
'x3': light containers,
'x4': paper and cardboard,
'x5': biodegradable waste.
Waste generation and composition are influenced by floating population, which makes waste composition a useful predictor of this difficult-to-measure demographic quantity.
wastewaste
An object of class data.frame with 215 rows and 10 columns.
Coenders, G., Martín-Fernández, J. A., & Ferrer-Rosell, B. (2017). When relative and absolute information matter: compositional predictor with a total in generalized linear models. Statistical Modelling, 17(6), 494–512.
The 'weibo_hotels' data set compares the use of Weibo (the Chinese equivalent of Facebook) in hospitality e-marketing between small and medium establishments and larger hotel businesses in China.
The 50 latest posts from the Weibo page of each hotel () were
content-analysed and coded into a four-part composition:
.
Hotels were also classified by size as large ('L') or small ('S').
weibo_hotelsweibo_hotels
An object of class data.frame with 10 rows and 5 columns.