Abstract
The provenance of siliclastic sediment may be traced using a wide variety of chemical,
mineralogical and isotopic proxies. These define three distinct data types: (1) compositional data
such as chemical concentrations; (2) point-counting data such as heavy mineral compositions;
and (3) distributional data such as zircon U-Pb age spectra. Each of these three data types
requires separate statistical treatment. Central to any such treatment is the ability to quantify
the ‘dissimilarity’ between two samples. For compositional data, this is best done using a logratio
distance. Point-counting data may be compared using the chi-square distance, which deals better
with missing components (zero values) than the logratio distance does. Finally, distributional data
can be compared using the Kolmogorov-Smirnov and related statistics.
For small datasets using a single provenance proxy, data interpretation can sometimes be done
by visual inspection of ternary diagrams or age spectra. But this no longer works for larger and
more complex datasets. This paper reviews a number of multivariate ordination techniques to
aid the interpretation of such studies. Multidimensional Scaling (MDS) is a generally applicable
method that displays the salient dissimilarities and differences between multiple samples as a
configuration of points in which similar samples plot close together and dissimilar samples plot far
apart. For compositional data, classical MDS analysis of logratio data is shown to be equivalent to
Principal Component Analysis (PCA). The resulting MDS configurations can be augmented with
compositional information as biplots. For point-counting data, classical MDS analysis of chi-square
distances is shown to be equivalent to Correspondence Analysis (CA). This technique also produces
biplots.
Thus, MDS provides a common platform to visualise and interpret all types of provenance data. Generalising the method to three-way dissimilarity tables provides an opportunity to combine several multi-method datasets together and thereby facilitate the interpretation of ‘Big Data’. This paper presents a set of tutorials using the statistical programming language R. It illustrates the theoretical underpinnings of compositional data analysis, PCA, MDS and other concepts using toy examples, before applying these methods to real datasets with the provenance package.
At its most basic level, sedimentary provenance analysis identifies the mineralogical, chemical or isotopic
composition of individual grains, or assemblages of multiple grains in siliclastic sediment. These properties
can then be used to group samples of similar affinity, and thereby trace the flow of sediment through
a sediment routing system (e.g., Morton, 1991; Weltje and von Eynatten, 2004; Gerdes and
Zeh, 2006; Rittner et al., 2016; Mazumder, 2017). Different levels of statistical complexity arise when
multiple samples are compared to each other, or when multiple provenance proxies are applied to multiple
samples.
Using a number of short tutorials, this paper will introduce several simple but effective exploratory data
analysis techniques that can help to make geological sense of ‘Big Data’ in a sedimentary provenance context.
The term ‘exploratory’ means that these techniques allow the user to explore the data independent of any
prior knowledge about the geological setting (Tukey, 1977; DuToit et al., 1986; Kenkel, 2006; Martinez
et al., 2017). It groups a number of graphical methods to visualise the data and reveal patterns of similarity
and differences between samples and variables. This paper will not introduce methods such as discriminant
analysis that formally assign samples to pre-defined provenance areas or petrotectonic settings
(Bhatia, 1983; Bhatia and Crook, 1986).
These notes do by no means pretend to give a comprehensive overview of exploratory data analysis. The
selection of methods presented herein is heavily biased towards techniques that are implemented in a software
package for (sedimentary) geology that was created by Vermeesch et al. (2016).
provenance is available free of charge at the Comprehensive R Archive Network (CRAN, http://cran. r-project.org/web/packages), on GitHub (http://github.com/pvermees/provenance), or via http://provenance.london-geochron.com. The package is written in the statistical programming language R, which is available for Windows, Mac OS-X and Linux/Unix. The easiest way to install the latest stable version of the package is to first install R from http://r-project.org and then type the following code at the command prompt (i.e. the ‘>’):
> install.packages('provenance')
Once installed, the package can be loaded by typing:
> library(provenance)
There are two ways to use provenance. The first of these is through a query-based user interface. To access this interface, type:
> provenance()
The main advantage of the query-based user interface is that it does not require any knowledge of R. Its main
disadvantage is the relative lack of flexibility and the difficulty to automate complex and/or repetitive tasks.
The second way to use provenance is via the R language itself. This is the quicker and more flexible option,
whose only downside is a steeper learning curve compared to the query-based interface. This tutorial will help
the reader to climb this learning curve whilst explaining the theoretical underpinnings of the methods that
are implemented in the package.
This text assumes that the reader has a basic understanding of the R programming language, although a
short tutorial is provided in the Appendix for readers who lack such prior knowledge. The paper also assumes
that the reader has some basic statistical knowledge. More specifically (s)he is expected to be familiar with
the normal distribution, and understand the meaning of the arithmetic mean, standard deviation and
confidence intervals. The normal distribution underpins much of ‘conventional’ statistics, but we will see
that it rarely applies to provenance data. This, in fact, is the main take-home message of this
paper.
There exist three fundamental types of provenance data:
Finally, Section 11 will consider the case where multiple compositional, point-counting and/or distributional datasets are combined. Procrustes analysis and 3-way multidimensional scaling are statistical techniques that aim to extract geologically meaningful trends from such ‘Big Data’ (Vermeesch and Garzanti, 2015).
Summary: This tutorial investigates the ratios of two sets of random numbers. It shows that the arithmetic
mean and confidence intervals of these synthetic data yield nonsensical results. These problems are solved by
a logarithmic transformation. This simple example has important implications because ratio data are common
in sedimentary provenance analysis, and are closely related to compositional data, which are introduced in
Section 3.
Many statistical operations assume normality. This includes averaging, the construction of confidence
intervals, regression, etc. Although Gaussian distributions are common, it would be unwise to assume
normality for all datasets. This paper makes the point that, more often than not, the normality assumption is
invalid in the context of sedimentary provenance analysis. Ignoring this non-normality can lead to
counter-intuitive and plainly wrong results.
To illustrate this point, we will now consider the simple case of ratio data, which are quite common in the Earth Sciences. Take, for example, the ratio of apatite to tourmaline in heavy mineral analysis, which has been used to indicate the duration of transport and storage prior to deposition (Morton and Hallsworth, 1999). In this part of the tutorial, we will investigate the statistics of ratio data using a synthetic example.
ns <- 100
A <- runif(ns)
B <- runif(ns)
Intuitively, given that A∕B = 1∕(B∕A) and B∕A = 1∕(A∕B), we would expect the same to be true for their means (A∕B) and (B∕A). However, when we define two new variables for the (inverse) of the (reciprocal) mean ratios:
AB.mean <- mean(A/B)
inv.BA.mean <- 1/mean(B/A)
then we find that AB.mean≠inv.BA.mean. So (A∕B)≠1∕(B∕A) and (B∕A)≠1∕(A∕B)! This is a counterintuitive and clearly wrong result.
AB.sd <- sd(A/B)
LL <- AB.mean - 2*AB.sd
UL <- AB.mean + 2*AB.sd
then we find that LL < 0, which is nonsensical since A and B are both strictly positive numbers and their ratio is therefore not allowed to take negative values either. Herein lies the root of the problem. The sampling distribution of A/B is positively skewed, whereas the normal distribution is symmetric with tails ranging from -∞ to +∞. Geologists frequently encounter strictly positive numbers. Time, for example, is a strictly positive quantity, expressed by geochronologists as ‘years before present’, where ‘present’ is equivalent to zero.
logAB <- log(A/B)
logBA <- log(B/A)
AB.gmean <- exp(mean(logAB))
inv.BA.gmean <- 1/exp(mean(logBA))
then we find that AB.gmean = inv.BA.gmean, which is a far more sensible result.
LL <- exp( mean(logAB) - 2*sd(logAB) )
UL <- exp( mean(logAB) + 2*sd(logAB) )
also produces strictly positive values, as expected.
Summary: Compositional data such as chemical concentrations suffer from the same problems as the
ratio data of Section 2. The tutorial uses a geochemical dataset of Al2O3 – (CaO+Na2O) –
K2O data to demonstrate that the ‘conventional’ arithmetic mean and confidence intervals are
inappropriate for data that can be constrained to a constant sum. A logratio transformation solves these
problems.
Like the ratios of the previous Section, the chemical compositions of rocks and minerals are also expressed as strictly positive numbers. They, however, do not span the entire range of positive values, but are restricted to a narrow subset of that space, ranging from 0 to 1 (if fractions are used) or from 0 to 100% (using percentage notation). The compositions are further restricted by a constant sum constraint:
for an n-component system. Consider, for example, a three-component system {x,y,z}, where x + y + z = 1. Such compositions can be plotted on ternary diagrams, which are very popular in geology. Well-known examples are the Q-F-L diagram of sedimentary petrography (Garzanti, 2019), the A-CN-K diagram in weathering studies (Nesbitt and Young, 1989), and the A-F-M, Q-A-P and Q-P-F diagrams of igneous petrology (Le Maitre et al., 2002). The very fact that it is possible to plot a ternary diagram on a two-dimensional sheet of paper already tells us that it really displays only two and not three dimensions worth of information. Treating the ternary data space as a regular Euclidean space with Gaussian statistics leads to wrong results, as illustrated by the following example.
ACNK <- read.csv('ACNK.csv',row.names=1,header=TRUE,check.names=FALSE)
where row.names=1 indicates that the sample names are contained in the first column; and the header=TRUE and check.names arguments indicate that the first column of the input table contains the column headers, one of which contains a special character (‘+’).
mu <- colMeans(ACNK)
sig <- apply(ACNK,MARGIN=2,FUN='sd')
and construct the 2-sigma confidence confidence bounds:
LL <- mu - 2*sig
UL <- mu + 2*sig
library(provenance)
Now plot the Al2O3, (CaO + Na2O) and K2O compositions on a ternary diagram alongside the arithmetic mean composition:
plot(ternary(ACNK),pch=20,labels=NA)
points(ternary(mu),pch=22,bg='blue')
where ternary(x) creates a ternary data ‘object’ from a variable x, and pch=20 and pch=22 produce filled circles and squares, respectively. Notice how the arithmetic mean plots outside the data cloud, and therefore fails to represent the compositional dataset (Figure 1).
source('helper.R')
ternary.polygon(LL,UL,col='blue')
Note that the polygon partly plots outside the ternary diagram, into physically impossible negative data space. This nonsensical result is diagnostic of the dangers of applying ‘normal’ statistics to compositional data. It is similar to the negative limits for the ratio data in Section 2.
A comprehensive solution to the compositional data conundrum was only found in the 1980s, by Scottish statistician John Aitchison (1986). It is closely related to the solution of the ratio averaging problem discussed in the previous section. The trick is to map the n-dimensional composition to an (n-1)-dimensional Euclidean space by means of a logratio transformation. For example, in the ternary case, we can map the compositional variables x, y and z to two transformed variables v and w:
| (1) |
After performing the statistical analysis of interest (e.g., calculating the mean or constructing a 95% confidence region) on the transformed data, the results can then be mapped back to compositional space with the inverse logratio transformation. For the ternary case:
| (2) |
This transformation is implemented in the provenance package. Let us use this feature to revisit the K-CN-A dataset, and add the logratio mean and 95% confidence region to the ternary diagram for comparison with the arithmetic mean and confidence polygon obtained before.
mug <- exp(colMeans(log(ACNK)))
points(ternary(mug),pch=22,bg='red')
This red square falls right inside the data cloud, an altogether more satisfying result than the arithmetic mean shown in blue (Figure 1).
ACNK2 <- read.compositional('ACNK.csv',check.names=FALSE)
Adding the 95% confidence contour using provenance’s ternary.ellipse function:
ternary.ellipse(ACNK,alpha=0.05)
creates a 95% confidence ellipse in logratio space, and maps this back to the ternary diagram. This results in a ‘boomerang’-shaped contour that tightly hugs the compositional data whilst staying inside the boundaries of the ternary diagram (Figure 1).
This Section (and Section 8) only touch the bare essentials of compositional data analysis. Further information about this active field of research can be found in Pawlowsky-Glahn et al. (2015). For additional R-recipes for compositional data analysis using the compositions package, the reader is referred to Van den Boogaart and Tolosana-Delgado (2013).
Summary: Point-counting data such as heavy mineral counts are underlain by compositional distributions.
But they are not amenable to the logratio transformations introduced in Section 3 because they commonly
contain zero values. Averages and confidence intervals for this type of data require hybrid statistical models
combining compositional and multinomial aspects.
The mineralogical composition of silicilastic sediment can be determined by tallying the occurrence of various
minerals in a representative sample of (200-400, say) grains (Van der Plas and Tobi, 1965; Weltje, 2002).
Such point-counting data are closely related to the compositional data that were discussed in the
previous section. However, there are some crucial differences between these two data classes
(Vermeesch, 2018b).
Point-counting data are associated with significant (counting) uncertainties, which are ignored by classical compositional data analysis. As a consequence, point-counting data often contain zero values, which are incompatible with the log-ratio transformation defined in Equation 1. Although ‘rounding zeros’ also occur in compositional data, where they can be removed by ‘imputation’ methods (Martín-Fernández et al., 2003; Bloemsma and Weltje, 2015), these methods are ill-suited for point-counting datasets in which zeros are the rule rather than the exception.
HM <- read.counts('HM.csv')
Galbraith (1988)’s radial plot is an effective way to visually assess the degree to which the random counting uncertainties account for the observed scatter of binary point-counting data. Applying this to the epidote/garnet-ratio of the heavy mineral data (Figure 2):
radialplot(HM,num='ep',den='gt')
Each circle on the resulting scatter plot represents a single sample in the HM dataset. Its epidote/garnet-ratio can be obtained by projecting the circle onto the circular scale. Thus, low and high ratios are found at negative and positive angles to the origin, respectively. The horizontal distance of each point from the origin is proportional to the total number of counts in each sample and, hence, to its precision. An (asymmetric) 95% confidence interval for the ep/gt-ratio of each sample can be obtained by projecting both ends of a 2-sigma confidence bar onto the circular scale.
Suppose that the data are underlain by a single true population and random counting uncertainties are the sole source of scatter. Let θ be the true but unknown proportion of the binary subpopulation that consists of the first mineral (epidote, say). Then (1 - θ) is the fraction of grains that belong to the second mineral (garnet). Further suppose that we have counted a representative sample of N grains from this population. Then the probability that this sample contains n grains of the first mineral and m = N - n grains of the second mineral follows a binomial distribution:
| (3) |
If multiple samples in a dataset are indeed underlain by the same fraction θ, then approximately 95% of the
samples should fit within a horizontal band of two standard errors drawn on either side of the origin. In this
case, θ can be estimated by pooling all the counts together and computing the proportion of the first mineral
as a fraction of the total number of grains counted (Vermeesch, 2018b).
However, the ep/gt-ratios in HM scatter significantly beyond the 2-sigma band (Figure 2.i). The data are therefore said to be overdispersed with respect to the counting uncertainties. This indicates the presence of true geological dispersion in the compositions that underly the point-counting data. The dispersion can be estimated by a random effects model with two parameters:
| (4) |
where β is a new variable that follows a normal distribution with mean μ and standard deviation σ, both of
which have geological significance.
The ‘central ratio’ is given by exp[] where is the maximum likelihood estimate for μ. This estimates the geometric mean (ep/gt-) ratio of the true underlying composition. The ‘dispersion’ () estimates the geological scatter (Galbraith, 1990; Vermeesch, 2018b). In the case of our heavy mineral dataset, the epidote-garnet subcomposition is 75% dispersed. This means that the coefficient of variation (standard deviation divided by geometric mean) of the true epidote/garnet-ratios is approximately 0.75.
tern <- ternary(HM,x='gt',y='ep',z='zr')
plot(tern,pch=1,labels=NA)
ternary.ellipse(tern,alpha=0.05)
> central(HM)
This produces a matrix with the proportions of each component; its standard error; the dispersion of the binary subcomposition formed by the component and the amalgamation of all remaining components; and the outcome of a chi-square test for homogeneity.
Summary: This Section investigates a 16-sample, 1547-grain dataset of detrital zircon U-Pb ages from
Namibia. It uses Kernel Density Estimation and Cumulative Age Distributions to visualise this dataset, and
introduces the Kolmogorov-Smirnov statistic as a means of quantifying the dissimilarity between
samples.
Compositional data such as the chemical concentrations of Sections 3 and 8 are characterised by the relative proportions of a number of discrete categories. A second class of provenance proxies are based on the sampling distribution of continuous variables such as zircon U-Pb ages (Fedo et al., 2003; Gehrels, 2011). These distributional data do not fit in the statistical framework of the (logistic) normal distribution.
DZ <- read.distributional("DZ.csv")
DZ now contains an object of class distributional containing the zircon U-Pb ages of 16 Namibian sand samples. To view the names of these samples:
> names(DZ)
| (5) |
where is the ‘kernel’ and bw is the ‘bandwidth’ (Silverman, 1986; Vermeesch, 2012). The kernel can be any unimodal and symmetric shape (such as a box or a triangle), but is most often taken to be Gaussian (where xi is the mean and bw the standard deviation). The bandwidth can either be set manually, or selected automatically based on the number of data points and the distance between them. provenance implements the automatic bandwidth selection algorithm of Botev et al. (2010) but a plethora of alternatives are available in the statistics literature. To plot all the samples as KDEs:
kdes <- KDEs(DZ)
plot(kdes,ncol=2)
where ncol specifies the number of columns over which the KDEs are divided.
| (6) |
where 1(*) = 1 if * is true and 1(*) = 0 if * is false. The main advantages of CADs over KDEs is that (i) they do not require any smoothing (i.e., there is no ‘bandwidth’ to choose), and (ii) they can superimpose multiple samples on the same plot. Plotting samples N1, N2 and N4 of the Namib dataset:
plot(DZ,snames=c('N1','N2','N4'))
we can see that (1) the CADs of samples N1 and N2 plot close together with steepest sections at 500 Ma and 1000 Ma, reflecting the prominence of those age components; (2) sample N4 is quite different from N1 and N2.
> N124 <- subset(DZ,select=c('N1','N2','N4'))
> diss(N124)
This shows that the KS-statistic between N1 and N2 is KS(N1,N2) = 0.18, whereas KS(N1,N4) = 0.44, and KS(N2,N4) = 0.35 (Figure 3). The KS statistic is a non-negative value that takes on values between zero (perfect overlap between two distributions) and one (no overlap between two distributions). It is symmetric because the KS statistic between any sample x and another sample y equals that between y and x. For example, KS(N1,N2) = 0.18 = KS(N2,N1). And finally, the KS-statistic obeys the triangle equality, which means that the dissimilarity between any two samples is always smaller than or equal to the sum of the dissimilarities between those two samples and a third. For example, KS(N1,N2) = 0.18 < KS(N1,N4) + KS(N2,N4) = 0.44 + 0.35 = 0.79. These three characteristics qualify the KS statistics as a metric, which makes it particularly suitable for Multidimensional Scaling (MDS) analysis (see Section 7). The KS statistic is just one of many dissimilarity measures for distributional data. However, not all these alternatives to the KS statistic fulfil the triangle inequality (Vermeesch, 2018a).
Summary: Principal Component Analysis is an exploratory data analysis method that takes a high dimensional dataset as input and produces a lower (typically two-) dimensional ‘projection’ as output. PCA is closely related to Multidimensional Scaling (MDS), compositional PCA, and Correspondence Analysis (CA), which are introduced in Sections 7–9. This tutorial introduces PCA using the simplest working example of three two-dimensional points. Nearly identical examples will be used in Sections 7–9.
| (7) |
Generating and plotting X in R:
X <- matrix(c(-1,3,4,7,2,3),nrow=3,ncol=2)
colnames(X) <- c('a','b')
plot(X)
yields a diagram in which two of the three data points plot close together while the third one plots further away.
| (8) |
where C is the centre (arithmetic mean) of the two data columns; S are the normalised scores; the diagonals of V correspond to the standard deviations of the two principal components; and D is a rotation matrix (the principal directions). S, V and D can be recombined to define two more matrices:
| (9) |
| (10) |
where P is a matrix of transformed coordinates (the principal components or scores) and L are the scaled eigenvectors or loadings. Figure 4.i shows X as numbers on a scatterplot, C as a yellow square, and 12,1C ± L as a cross. Thus, the first principal direction (running from the upper left to the lower right) has been stretched by a factor of (3.67∕0.71) = 5.2 w.r.t the second principal component, which runs perpendicular to it. The two principal components are shown separately as Figure 4.ii, and their relative contribution to the total variance of the data as Figure 4.iv. Figure 4 can be reproduced with the following R-code:
source('helper.R')
PCA2D(X)
pc <- prcomp(USArrests, scale=TRUE)
biplot(pc)
You will see that the loading vectors for Murder, Assault and Rape are all pointing in approximately the same direction (dominating the first principal component), perpendicular to UrbanPop (which dominates the second principal component). This tells us that crime and degree of urbanisation are not correlated in the United States.
Summary: Multidimensional Scaling (MDS) is a less restrictive superset of PCA. This tutorial uses a geographical example to demonstrate how MDS re-creates a map of Europe from a table of pairwise distances between European cities. Applying the same algorithm to the synthetic toy-example of Section 6 yields exactly the same output as PCA.
> eurodist
conf <- cmdscale(eurodist)
Set up an empty plot with a 1:1 aspect ratio, and then label the MDS configuration with the city names:
plot(conf,type='n',asp=1)
text(conf,labels=labels(eurodist))
Note that the map may be turned ‘upside down’. This reflects the rotation invariance of MDS configurations.
d <- dist(X)
which produces:
| (11) |
conf2 <- cmdscale(d)
Finally, plot the MDS configuration as a scatterplot of text labels:
plot(conf2,type='n')
text(conf2,labels=1:3)
Which is identical to the PCA configuration of Figure 4.iii apart from an arbitrary rotation or reflection.
library(MASS)
To compute and plot the non-metric MDS configuration:
conf3 <- isoMDS(eurodist)$points
plot(conf3,type='n',asp=1)
text(conf3,labels=labels(eurodist))
where conf3 is a list with two items: stress, which expresses the goodness-of-fit of the MDS
configuration; and points, which contains the configuration. The ‘$’ operator is used to access any of
these items.
Non-metric MDS is a less-restrictive superset of classical MDS and, hence, PCA, which opens this methodology up to non-Euclidean dissimilarity measures, such as the KS-distance introduced in Section 5 (Vermeesch, 2013).
Summary: PCA can be applied to compositional data after logratio transformation. This tutorial first applies
such compositional PCA to a three sample, three variable dataset that is mathematically equivalent to the
three sample two variable toy example of Section 6. Then, it applies the same method to a real dataset of
major element compositions from Namibia. This is first done using basic R and then again (and more
succinctly) using the provenance package.
Consider the following trivariate (a, b and c) dataset of three (1, 2 and 3) compositions that are constrained to a constant sum (ai + bi + ci = 100% for 1 ≤ i ≤ 3, Figure 5):
| (12) |
It would be wrong to apply conventional PCA to this dataset, because this would ignore the constant sum constraint. As was discussed in Section 6, PCA begins by ‘centering’ the data via the arithmetic mean. Section 3 showed that this yields incorrect results for compositional data. Although the additive logratio transformation (alr) of Equation 1 solves the closure problem, it is not suitable for PCA because the axes of alr-coordinate spaces are not orthogonal to each other. For example, the distance between the alr-coordinates of samples 2 and 3 is 1.74 if b is used as a common denominator (i.e., d[2,3]b = 1.74), but 2.46 when c is used as a common denominator (d[2,3]c = 2.46).
The fact that distances are not unequivocally defined in alr-space spells trouble for PCA. Recall the equivalence of PCA and classical MDS, which was discussed in Section 7. MDS is based on dissimilarity matrices, so if distances are not well defined then neither are the MDS configuration and, hence, the principal components. This issue can be solved by centred logratio transformation (clr):
| (13) |
where gi is the geometric mean of the ith sample:
gi = exp |
Unlike alr-coordinate axes, clr-coordinate axes are orthogonal and unequivocally define a set of well behaved distances. Applying the clr-transformation to the data of Equation 12 yields a new trivariate dataset:
| (14) |
where g stands for the geometric mean of each row. Note that each of the rows of Xc adds up to zero. Thanks to the symmetry of the clr-coordinates, the distances between the rows (which are also known as Aitchison distances) are well defined. Subjecting Equation 14 to the same matrix decompositions as Equation 8 yields:
| (15) |
so that
| (16) |
Note that, even though this yields three principal components instead two, the variance of the third component in matrix V is zero. Therefore, all the information is contained in the first two components. Also note that the first two principal components of the compositional dataset are identical to those of the PCA example shown in Section 6 (Equation 9). This is, of course, intentional.
# load the major element composition of Namib sand:
Major <- read.csv(file="Major.csv", header=TRUE,row.names=1)
# apply the centred logratio transformation:
cMajor <- log(Major) - rowMeans(log(Major)) %*% matrix(1,1,ncol(Major))
# perform PCA of the logratio transformed data:
pc <- prcomp(cMajor)
# plot the results of the PCA analysis:
biplot(pc)
library(provenance)
# tell R that Major.csv contains compositional data:
Major.comp <- read.compositional('Major.csv')
# perform the principal component analysis:
pc.comp <- PCA(Major.comp)
# create the biplot:
plot(pc.comp)
where the read.compositional function reads the .csv file into an object of class compositional, thus ensuring that logratio statistics are used in all provenance functions (such as PCA) that accept compositional data as input. Also note that the provenance package overloads the plot function to generate a compositional biplot when applied to the output of the PCA function.
Summary: Point-counting data can be analysed by MDS using the Chi-square distance. Correspondence
Analysis (CA) yields identical results whilst producing biplots akin to those obtained by PCA. This
tutorial first uses a simple three sample, three variable toy example that is (almost) identical to
those used in Sections 6-8, before applying CA to a real dataset of heavy mineral counts from
Namibia.
Consider the following three sets of trivariate point-counting data:
| (17) |
This dataset intentionally looks similar on a ternary diagram to the compositional dataset of Section 3. The
only difference is the presence of zeros, which preclude the use of logratio statistics. This problem can be
solved by replacing the zero values with small numbers, but this only works when their number is small
(Martín-Fernández et al., 2003; Bloemsma and Weltje, 2015). Correspondence Analysis (CA) is an
alternative approach that does not require such ‘imputation’.
CA is a dimension reduction technique that is similar in many ways to PCA (Greenacre, 1984; Vermeesch, 2018b). CA, like PCA, is a special case of MDS. Whereas ordinary PCA uses the Euclidean distance, and compositional data can be compared using the Aitchison distance, point-counting data can be compared by means of a chi-square distance:
| (18) |
where X⋅k = ∑ i=1mXik, Xi⋅ = ∑ k=1KXik and X⋅⋅ = ∑ i=1m ∑ k=1KXik. Applying this formula to the data of Equation 17 produces the following dissimilarity matrix:
| (19) |
Note that, although these values are different than those in Equation 11, the ratios between them are
(approximately) the same. Specifically, d1,2∕d1,3 = 1.5∕1.5 = 1 for Equation 19 and d1,2∕d1,3 = 6.4∕6.4 = 1
for Equation 11; or d1,2∕d2,3 = 1.5∕0.33 = 4.5 for Equation 19 and d1,2∕d2,3 = 6.4∕1.4 = 4.5 for
Equation 11. Therefore, when we subject our point-counting data to an MDS analysis using
the chi-square distance, the resulting configuration appears nearly identical to the example of
Section 7.
The following script applies CA to the heavy mineral composition of Namib desert sand. It loads a table called HM.csv that contains point counts for 16 samples and 15 minerals. To reduce the dominance of the least abundant components, the code extracts the most abundant minerals (epidote, garnet, amphibole and clinopyroxene) from the datasets and amalgamates the ultra-stable minerals (zircon, tourmaline and ru- tile), which have similar petrological significance.
library(provenance)
# tell R that HM.csv contains point-counting data:
dat <- read.counts('HM.csv')
# select and amalgamate the components of interest:
HM <- amalgamate(dat,ztr=c('zr','tm','rt'),ep='ep', gt='gt',amp='amp',cpx='cpx')
# perform the correspondence analysis:
config <- CA(HM)
# plot the results as a biplot:
plot(config)
Summary: This brief tutorial applies MDS to the detrital zircon U-Pb dataset from Namibia, using the
Kolmogorov-Smirnov statistic as a dissimilarity measure.
Section 7.5 introduced non-metric MDS as a less-restrictive superset of classical MDS and, hence, PCA. This opens this methodology up to non-Euclidean dissimilarity measures, such as the KS-distance introduced in Section 5.4 (Vermeesch, 2013, 2018a).
library(provenance)
# read the detrital zircon dataset:
DZ <- read.distributional("DZ.csv")
# calculate and plot the (non-metric)
# MDS configuration using the KS distance:
DZ.X <- MDS(DZ)
plot(DZ.X)
In this case, the overloaded plot function produces not one but two graphics windows. The first of these shows the MDS configuration, whereas the second shows the Shepard plot (Shepard, 1962; Kruskal and Wish, 1978). This is a scatterplot that sets out the Euclidean distances between the samples measured on the MDS configuration against the disparities, which are defined as:
| (20) |
where KS[xi,xj] is the KS-distance between the ith and jth sample and f is a monotonic transformation, which is shown as a step-function. The Shepard plot allows the user to visually assess the goodness-of-fit of the MDS configuration. This can be further quantified using the ‘stress’ parameter:
| (21) |
The lower the stress, the better the fit. For moderately sized datasets, stress values should be less than 10% (Kruskal and Wish, 1978). For larger datasets, a higher dimensional solution may be necessary, using the optional parameter k of provenance’s MDS function (Stephan et al., 2018).
Summary: The tutorial jointly analyses 16 Namibian samples using five different provenance proxies,
including all three data classes introduced in Sections 3-5. It introduces Procrustes Analysis and 3-way MDS
as two alternative ways to extract geologically meaningful information from these multivariate ‘big’
dataset.
It is increasingly common for provenance studies to combine compositional, point-counting or distributional datasets together (Vermeesch and Garzanti, 2015; Rittner et al., 2016). Linking together bulk sediment data, heavy mineral data and single mineral data requires not only a sensible statistical approach, but also a full appraisal of the impact of mineral fertility and heavy mineral concentration in eroded bedrock and derived clast sediment (Garzanti and Andò, 2007; Malusà et al., 2016; Malusà and Garzanti, 2019). Assuming that this appraisal has been made, this Section introduces some exploratory data analysis tools that can reveal meaningful structure in complex datasets.
All these datasets can be visualised together in a single summary plot:
library(provenance)
# major elements:
Major <- read.compositional('Major.csv')
# trace elements:
Trace <- read.compositional('Trace.csv')
# petrography:
QFL <- read.counts('PT.csv',colmap=cm.colors)
# heavy minerals:
HM <- read.counts('HM.csv',colmap=cm.colors)
# zircon U-Pb dates:
DZ <- read.distributional('DZ.csv')
# generate the plot:
summaryplot(Major,Trace,QFL,HM,KDEs(DZ),ncol=2)
where Major, Trace, QFL and HM are shown as pie charts (the latter two with a different colour map than the former), and DZ as KDEs. Adding DZ instead of KDEs(DZ) would plot the U-Pb age distributions as histograms.
proc <- procrustes(Major,Trace,QFL,HM,DZ)
plot(proc)
scal <- indscal(Major,Trace,QFL,HM,DZ)
plot(scal)
This code produces two pieces of graphical output (Figure 6). The ‘group configuration’ represents the
consensus view of all provenance proxies considered together. This looks very similar to
the Procrustes configuration created by the previous code snippet. The second piece of
graphical information displays not the samples but the provenance proxies. It shows the
weights that each of the proxies attach to the horizontal and vertical axis of the group
configuration.
For example, the heavy mineral compositions of the Namib desert sands can be (approximately)
described by stretching the group configuration vertically by a factor of 1.9, whilst shrinking it
horizontally by a factor of 0.4. In contrast, the configurations of the major and trace element
compositions for the same samples are obtained by shrinking the group configuration vertically by a
factor 0.8, and stretching it horizontally by a factor of 1.3. Thus, by combining these weights with
the group configuration yields five ‘private spaces’ that aim to fit each of the individual
datasets.
INDSCAL group configurations are not rotation-invariant, in contrast with the 2-way MDS
configurations of Section 7. This gives geological meaning to the horizontal and vertical axes of the
plot. For example, samples N1 and N10 plot along a vertical line on the group configuration, indicating
that they have different heavy mineral compositions, but similar major and trace element
compositions. On the other hand, samples N4 and N8 plot along a horizontal line, indicating that
they have similar major and trace element compositions but contrasting heavy mineral
compositions.
Closer inspection of the weights reveals that the datasets obtained from fractions of specific densities (HM, PT and DZ) attach stronger weights to the vertical axis, whereas those that are determined on bulk sediment (Major and Trace) dominate the horizontal direction. Provenance proxies that use bulk sediment are more sensitive to winnowing effects than those that are based on density separates. This leads to the interpretation that the horizontal axis separates samples that have been affected by different degrees of hydraulic sorting, whereas the vertical direction separates samples that have different provenance.
The statistical toolbox implemented by the provenance package is neither comprehensive nor at the cutting
edge of exploratory data analysis. PCA, MDS, CA, and KDEs are tried and tested methods that have been
around for many decades. Nothing new is presented here and that is intentional. This paper makes the point
that even the most basic statistical parameters such as the arithmetic mean and standard deviation cannot
be blindly applied to geological data (Chayes, 1949, 1960; Weltje, 2002). Great care must be taken when
applying established techniques to sedimentary provenance data such as chemical compositions, point-counts
or U-Pb age distributions. Given the difficulty of using even the simplest of methods correctly, geologists
may want to think twice before exploring more complicated methods, or inventing entirely new
ones.
The set of tutorials presented in this paper did not cover all aspects of statistical provenance
analysis. Doing so would fill a book rather than a paper. Some additional topics for such a book
could include (1) supervised and unsupervised learning algorithms such as cluster analysis and
discriminant analysis, which can group samples into formal groups (Bhatia, 1983; Bhatia and
Crook, 1986; Armstrong-Altrin and Verma, 2005; Tolosana-Delgado et al., 2018); (2) the physical and
chemical processes that affect the composition of sediment from ‘source to sink’ (Allen, 2008; Weltje and von
Eynatten, 2004; Weltje, 2012; Garzanti et al., 2018); and (3) quality checks and corrections that must be
made to ensure that the data reveal meaningful provenance trends rather than sampling effects (Garzanti and
Andò, 2007; Garzanti et al., 2009; Resentini et al., 2013; Malusà et al., 2013; Malusà and
Garzanti, 2019).
The paper introduced three distinct classes of provenance data. Compositional, point-counting and
distributional data each require different statistical treatment. Multi-sample collections of these data can be
visualised by Multidimensional Scaling, using different dissimilarity measures (Table 1). Distributional data
can be compared using the Kolmogorov-Smirnov statistic or related dissimilarity measures, and plugged
straight into an MDS algorithm for further inspection. Compositional data such as chemical concentrations
can be visualised by conventional ‘normal’ statistics after logratio transformation. The Euclidean distance in
logratio space is called the Aitchison distance in compositional data space. Classical MDS using this distance
is equivalent to Principal Component Analysis. Finally, point-counting data combine aspects of compositional
data analysis with multinomial sampling statistics. The Chi-square distance is the natural way to
quantify the dissimilarity between multiple point-counting samples. MDS analysis using the
Chi-square distance is equivalent to Correspondence Analysis, which is akin to PCA for categorical
data.
However, there are some provenance proxies that do not easily fit into these three categories. Varietal studies using the chemical composition of single grains of heavy minerals combine aspects of compositional and distributional data (Morton, 1991; Tolosana-Delgado et al., 2018). Similarly, paired U-Pb ages and Hf-isotope compositions in zircon (Gerdes and Zeh, 2006) do not easily fit inside the distributional data class described above. Using the tools provided by the provenance package, such data can be processed by procustes analysis or 3-way MDS (Section 11). Thus, U-Pb and ϵ(Hf)-distributions, say, could be entered into the indscal function as separate entities. However by doing so the single-grain link between the two datasets would be lost. Alternative approaches may be pursued to address this issue, and new dissimilarity measures could be developed for this hybrid data type. Matrix decomposition may be a way forward in this direction (Paatero and Tapper, 1994; Bloemsma et al., 2012; Martinez et al., 2017).
data type | dissimilarity measure | ordination technique |
compositional | Aitchison | Principal Component Analysis |
point-counting | Chi-square | Correspondence Analysis |
distributional | Kolmogorov-Smirnov | Multidimensional Scaling |
R is an increasingly popular programming language for scientific data processing. It is similar in scope and purpose to Matlab but is available free of charge on any operating system at http://r-project.org. A number of different graphical user interfaces (GUIs) are available for R, the most popular of which are RGui, RStudio, RCommander and Tinn-R. For this tutorial, however, the simple command line console suffices.
> 1 + 1
> sqrt(2)
> exp(log(10))
> 13%%5
where the ‘>’ symbol marks the command prompt.
> foo <- 2
> 4 -> bar
> foo <- foo*bar
> myvec <- c(2,4,6,8)
> myvec*2
Query the third value of the vector:
> myvec[3]
Change the third value of the vector:
> myvec[3] <- 100
Change the second and the third value of the vector:
> myvec[c(2,3)] <- c(100,101)
Create a vector of 1, 2, 3, ..., 10:
> seq(from=1,to=10,by=1)
Equivalently:
> seq(1,10,1)
> seq(1,10)
> seq(to=10,by=1,from=1)
> seq(to=10)
> 1:10
Create a 10-element vector of twos:
> rep(2,10)
> mymat <- matrix(1,nrow=2,ncol=4)
Change the third value in the first column of mymat to 3:
> mymat[1,3] <- 3
Change the entire second column of mymat to 2:
> mymat[,2] <- 2
The transpose of mymat:
> t(mymat)
Element-wise multiplication (*) vs. matrix multiplication (%*%):
> mymat * mymat
> mymat %*% t(mymat)
> mylist <- list(v=myvec, m=mymat, nine=9)
> mylist$v
> plot(mymat[1,],mymat[2,],type='p')
Draw lines between the points shown on the existing plot:
> lines(mymat[1,],mymat[2,])
Create a new plot with red lines but no points:
> plot(mymat[1,],mymat[2,],type='l',col='red')
Use a 1:1 aspect ratio for the X- and Y-axis:
> plot(mymat[1,],mymat[2,],type='l',col='red',asp=1)
> dev.copy2pdf(file="trigonometry.pdf")
> help(c)
> ?plot
> cube <- function(n){
> return(n^3)
> }
Using the newly created function:
> cube(2)
> result <- cube(3)
> rand.num <- runif(100)
> hist(rand.num)
> ls()
Remove all the variables in the current workspace:
> rm(list=ls())
To get and set the working directory:
> getwd()
> setwd("/path/to/a/valid/directory")
# the 'print' function is needed to show intermediate
# results when running commands from an .R file
print(pi)
This code can be run by going back to the command prompt (hence the ‘>’ in the next box) and typing:
> source("myscript.R")
This should result in the number π being printed to the console. Note that everything that follows the ‘#’-symbol was ignored by R.
toss <- function(){
if (runif(1)>0.5){
print("head")
} else {
print("tail")
}
}
Save and run at the command prompt:
> source('myscript.R')
> toss()
fibonnaci <- function(n){
if (n < 3) { stop('n must be at least 3') }
# seed the output vector with 0 and 1:
s <- c(0,1)
# loop through all numbers from 3 to n:
for (i in 3:n){
s[i] <- s[i-1] + s[i-2]
}
return(s)
}
Save and run at the command prompt to calculate the first 20 numbers in the Fibonnaci series:
> source('myscript.R')
> fibonnaci(20)
> install.packages('compositions')
Use the newly installed package to plot the built-in SkyeAFM dataset, which contains the Al2O3 – FeO – MgO compositions of 23 aphyric lavas from the isle of Skye.
library(compositions) # load the package into memory
dat <- data(SkyeAFM) # load the Skye lava dataset
AFMcomp <- acomp(dat) # enforce the constant sum constraint
plot(AFMcomp) # plot as a ternary diagram
Note that the plot() function has been overloaded for compositional data.
This paper evolved from a set of lecture notes for an iCRAG workshop in sedimentary provenance analysis at NUI Galway. The author would like to thank Sergio Andò for inviting him to contribute to this Special Issue. The manuscript greatly benefited from three critical but constructive reviews.
Aitchison, J. Principal component analysis of compositional data. Biometrika, 70(1):57–65, 1983. doi: 10.1093/biomet/70.1.57.
Aitchison, J. The statistical analysis of compositional data. London, Chapman and Hall, 1986.
Aitchison, J. and Brown, J. A. The lognormal distribution. Cambridge Univ. Press, 1957. ISBN 0521040116.
Allen, P. A. From landscapes into geological history. Nature, 451(7176):274, 2008.
Armstrong-Altrin, J. and Verma, S. P. Critical evaluation of six tectonic setting discrimination diagrams using geochemical data of Neogene sediments from known tectonic settings. Sedimentary Geology, 177(1-2):115–129, 2005.
Bhatia, M. R. Plate tectonics and geochemical composition of sandstones. The Journal of Geology, 91(6):611–627, 1983.
Bhatia, M. R. and Crook, K. A. Trace element characteristics of graywackes and tectonic setting discrimination of sedimentary basins. Contributions to mineralogy and petrology, 92(2):181–193, 1986.
Bloemsma, M. R. and Weltje, G. J. Reduced-rank approximations to spectroscopic and compositional data: A universal framework based on log-ratios and counting statistics. Chemometrics and Intelligent Laboratory Systems, 142:206–218, 2015.
Bloemsma, M., Zabel, M., Stuut, J., Tjallingii, R., Collins, J., and Weltje, G. J. Modelling the joint variability of grain size and chemical composition in sediments. Sedimentary Geology, 280: 135–148, 2012.
Botev, Z. I., Grotowski, J. F., and Kroese, D. P. Kernel density estimation via diffusion. Annals of Statistics, 38:2916–2957, 2010.
Carroll, J. D. and Chang, J.-J. Analysis of individual differences in multidimensional scaling via an N-way generalization of ‘Eckart-Young’ decomposition. Psychometrika, 35(3):283–319, 1970.
Chayes, F. On ratio correlation in petrography. The Journal of Geology, 57(3):239–254, 1949.
Chayes, F. On correlation between variables of constant sum. Journal of Geophysical research, 65 (12):4185–4193, 1960.
Cox, T. F. and Cox, M. A. Multidimensional scaling. CRC Press, 2000.
de Leeuw, J. and Mair, P. Multidimensional scaling using majorization: The R package smacof. Journal of Statistical Software, 31(3):1–30, 2009. URL http://www.jstatsoft.org/v31/i03/.
DeGraaff-Surpless, K., Mahoney, J., Wooden, J., and McWilliams, M. Lithofacies control in detrital zircon provenance studies: Insights from the Cretaceous Methow basin, southern Canadian Cordillera. Geological Society of America Bulletin, 115:899–915, 2003. doi: 10.1130/B25267.1.
DuToit, S. H., Steyn, A. G. W., and Stumpf, R. H. Graphical exploratory data analysis. Springer Science & Business Media, 1986.
Fedo, C., Sircombe, K., and Rainbird, R. Detrital zircon analysis of the sedimentary record. Reviews in mineralogy and geochemistry, 53:277–303, 2003.
Feller, W. On the Kolmogorov-Smirnov limit theorems for empirical distributions. The Annals of Mathematical Statistics, 19:177–189, 1948.
Galbraith, R. F. The radial plot: graphical assessment of spread in ages. Nuclear Tracks and Radiation Measurements, 17:207–214, 1990.
Galbraith, R. Graphical display of estimates having differing standard errors. Technometrics, 30 (3):271–281, 1988.
Garzanti, E. and Andò, S. Heavy-mineral concentration in modern sands: implications for provenance interpretation. In Mange, M. and Wright, D., editors, Heavy Minerals in Use, Developments in Sedimentology Series 58, pages 517–545. Elsevier, Amsterdam, 2007.
Garzanti, E., Andò, S., and Vezzoli, G. Grain-size dependence of sediment composition and environmental bias in provenance studies. Earth and Planetary Science Letters, 277:422–432, 2009. doi: 10.1016/j.epsl.2008.11.007.
Garzanti, E. Petrographic classification of sand and sandstone. Earth-Science Reviews, 2019. doi: 10.1016/j.earscirev.2018.12.014.
Garzanti, E., Dinis, P., Vermeesch, P., Andò, S., Hahn, A., Huvi, J., Limonta, M., Padoan, M., Resentini, A., Rittner, M., and Vezzoli, G. Sedimentary processes controlling ultralong cells of littoral transport: Placer formation and termination of the Orange sand highway in southern Angola. Sedimentology, 65(2):431–460, 2018.
Gehrels, G. Detrital zircon U-Pb geochronology: Current methods and new opportunities. In Busby, C. and Azor, A., editors, Tectonics of sedimentary basins: Recent advances, chapter 2, pages 45–62. Wiley Online Library, 2011.
Gerdes, A. and Zeh, A. Combined U–Pb and Hf isotope LA-(MC-) ICP-MS analyses of detrital zircons: comparison with SHRIMP and new constraints for the provenance and age of an Armorican metasediment in Central Germany. Earth and Planetary Science Letters, 249(1-2):47–61, 2006.
Gower, J. C. Generalized procrustes analysis. Psychometrika, 40(1):33–51, 1975.
Greenacre, M. J. Theory and applications of correspondence analysis. Academic Press, 1984.
Kenkel, N. On selecting an appropriate multivariate analysis. Canadian Journal of Plant Science, 86(3):663–676, 2006.
Kenkel, N. C. and Orlóci, L. Applying metric and nonmetric multidimensional scaling to ecological studies: some new results. Ecology, pages 919–928, 1986.
Kruskal, J. B. and Wish, M. Multidimensional scaling, volume 07-011 of Sage University Paper series on Quantitative Application in the Social Sciences. Sage Publications, Beverly Hills and London, 1978.
Le Maitre, R. W., Streckeisen, A., Zanettin, B., Le Bas, M., Bonin, B., and Bateman, P. Igneous rocks: a classification and glossary of terms: recommendations of the International Union of Geological Sciences Subcommission on the Systematics of Igneous Rocks. Cambridge University Press, 2002. ISBN 9780511535581.
Malusà, M. G. and Garzanti, E. The sedimentology of detrital thermochronology. In Fission-Track Thermochronology and its Application to Geology, pages 123–143. Springer, 2019.
Malusà, M. G., Carter, A., Limoncelli, M., Villa, I. M., and Garzanti, E. Bias in detrital zircon geochronology and thermochronometry. Chemical Geology, 359:90–107, 2013.
Malusà, M. G., Resentini, A., and Garzanti, E. Hydraulic sorting and mineral fertility bias in detrital geochronology. Gondwana Research, 31:1–19, 2016.
Martín-Fernández, J. A., Barceló-Vidal, C., and Pawlowsky-Glahn, V. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35(3):253–278, 2003.
Martinez, W. L., Martinez, A. R., and Solka, J. Exploratory data analysis with MATLAB. Chapman and Hall/CRC, 2017. ISBN 9781498776066.
Mazumder, R. Sediment provenance. In Mazumder, R., editor, Sediment Provenance: Influence on Compositional Change From Source to Sink, pages 1–4. Elsevier, 2017.
Morton, A. C. Geochemical studies of detrital heavy minerals and their application to provenance research. In Morton, A., Todd, S., and Haughton, P. D. W., editors, Developments in Sedimentary Provenance Studies, volume 57, pages 31–45. Geological Society of London, 1991.
Morton, A. C. and Hallsworth, C. R. Processes controlling the composition of heavy mineral assemblages in sandstones. Sedimentary Geology, 124(1-4):3–29, 1999.
Nesbitt, H. and Young, G. M. Formation and diagenesis of weathering profiles. The Journal of Geology, 97(2):129–147, 1989.
Paatero, P. and Tapper, U. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, 1994.
Pawlowsky-Glahn, V., Egozcue, J. J., and Tolosana-Delgado, R. Modeling and analysis of compositional data. John Wiley & Sons, 2015.
Pearson, K. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
Resentini, A., Malusà, M. G., and Garzanti, E. MinSORTING: An Excel® worksheet for modelling mineral grain-size distribution in sediments, with application to detrital geochronology and provenance studies. Computers & Geosciences, 59:90–97, 2013.
Ripley, B. Modern applied statistics with S. Statistics and Computing, fourth ed. Springer, New York, 2002.
Rittner, M., Vermeesch, P., Carter, A., Bird, A., Stevens, T., Garzanti, E., Andò, S., Vezzoli, G., Dutt, R., Xu, Z., and Lu, H. The provenance of Taklamakan desert sand. Earth and Planetary Science Letters, 437:127 – 137, 2016. ISSN 0012-821X.
Shepard, R. N. The analysis of proximities: multidimensional scaling with an unknown distance function. i. Psychometrika, 27(2):125–140, 1962.
Silverman, B. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986.
Stephan, T., Kroner, U., and Romer, R. L. The pre-orogenic detrital zircon record of the peri-gondwanan crust. Geological Magazine, pages 1–27, 2018.
Tolosana-Delgado, R., von Eynatten, H., Krippner, A., and Meinhold, G. A multivariate discrimination scheme of detrital garnet chemistry for use in sedimentary provenance analysis. Sedimentary Geology, 375:14–26, 2018.
Torgerson, W. S. Multidimensional scaling: I. Theory and method. Psychometrika, 17(4):401–419, 1952.
Tukey, J. W. Exploratory data analysis, volume 2. Addison-Wesley, 1977. ISBN 978-0-201-07616-5.
Van den Boogaart, K. G. and Tolosana-Delgado, R. “Compositions”: a unified R package to analyze compositional data. Computers & Geosciences, 34(4):320–338, 2008.
Van den Boogaart, K. G. and Tolosana-Delgado, R. Analyzing compositional data with R, volume 122 of UseR! Springer, 2013.
Van der Plas, L. and Tobi, A. A chart for judging the reliability of point counting results. American Journal of Science, 263(1):87–90, 1965.
Vermeesch, P. Quantitative geomorphology of the White Mountains (California) using detrital apatite fission track thermochronology. Journal of Geophysical Research (Earth Surface), 112(F11): 3004, 2007. doi: 10.1029/2006JF000671.
Vermeesch, P. On the visualisation of detrital age distributions. Chemical Geology, 312-313: 190–194, 2012. doi: 10.1016/j.chemgeo.2012.04.021.
Vermeesch, P. Multi-sample comparison of detrital age distributions. Chemical Geology, 341: 140–146, 2013.
Vermeesch, P. Dissimilarity measures in detrital geochronology. Earth-Science Reviews, 178: 310–321, 2018a. doi: 10.1016/j.earscirev.2017.11.027.
Vermeesch, P. Statistical models for point-counting data. Earth and Planetary Science Letters, 501:1–7, 2018b.
Vermeesch, P. and Garzanti, E. Making geological sense of ‘Big Data’ in sedimentary provenance analysis. Chemical Geology, 409:20–27, 2015.
Vermeesch, P., Resentini, A., and Garzanti, E. An R package for statistical provenance analysis. Sedimentary Geology, 2016.
Weltje, G. J. Quantitative models of sediment generation and provenance: state of the art and future developments. Sedimentary Geology, 280:4–20, 2012.
Weltje, G. J. and von Eynatten, H. Quantitative provenance analysis of sediments: review and outlook. Sedimentary Geology, 171(1-4):1–11, 2004.
Weltje, G. Quantitative analysis of detrital modes: statistically rigorous confidence regions in ternary diagrams and their use in sedimentary petrology. Earth-Science Reviews, 57(3-4):211 – 253, 2002. ISSN 0012-8252. doi: DOI: 10.1016/S0012-8252(01)00076-9.
Young, G. and Householder, A. S. Discussion of a set of points in terms of their mutual distances. Psychometrika, 3(1):19–22, 1938.