Introduction et al Ramette & Tiedje, 2007a b et al b et al et al et al et al et al et al Green & Bohannan, 2006 et al Table 1 Fig. 1 Table 1 Fig. 1 Table 1 Fig. 1 Table 1 Usage (%) of multivariate methods in different fields Exploratory analysis Hypothesis-driven analysis † Cluster PCA MDS PCoA CCA RDA manova Mantel anosim CVA ‡ * 48.5 38 4.5 0.4 3.2 1.8 1.3 0.4 0.9 1.1 1141 * 45.8 40.2 3.9 1.1 2.2 2.2 1.1 1.7 0.6 1.1  179 * 40.3 28.5 4.6 1.7 15.5 3.7 1.9 2.3 0.6 0.9 3335 * 54 27.2 2.8 1.1 8.5 2.8 0.9 1.1 0.2 1.4  563 * 30.1 33.7 9.8 0.3 13.5 2.7 3.6 2.9 2.3 1.2 1464 * 41 20.5 5.4 0.7 21.2 3.5 2.1 4.2 0.5 0.9  429 * 54.3 13.7 6.1 0.8 11.5 4.4 3.5 3 1.1 1.7  637 A literature search was performed with the Thomson ISI research tool with the following parameters (Doc type, all document types; language, all languages; databases, SCI-EXPANDED, SSCI, A&HCI; Timespan, 1900–2006) on December 13, 2006 in the titles and abstracts of the articles only. † Asterisks were placed at the end of each keyword to accommodate for variations. Each keyword was additionally combined with the following technical designations: cluster, cluster analysis; PCA, principal component analysis; MDS, multidimensional scaling; PcoA, principal coordinate analysis; CCA, canonical correspondence analysis; RDA, redundancy analysis; Mantel, Mantel test, or CVA, canonical variate analysis. ‡ Fig. 1 Fig. 1 Table 1 Table 1 James & McCulloch, 1990 ter Braak & Prentice, 1988 James & McCulloch, 1990 Legendre & Legendre, 1998 Leps & Smilauer, 1999 ter Braak & Smilauer, 2002 Palmer, 2006 In the first part, data type and preparation are reviewed as a necessary basis for subsequent multivariate analyses. Second, common multivariate methods (i.e. cluster analysis, principal component analysis, correspondence analysis, multidimensional scaling) and a few statistical methods to test for significant differences between groups or clusters are described, focusing on the methods' main objectives, applications, and limitations. Beyond the mere identification of diversity patterns, microbial ecologists may wish to correlate or explain those patterns using measured environmental parameters, and this approach is addressed in the third part. Special emphasis is placed on a few methods that have proven useful in ecological studies, namely redundancy analysis, CCA, linear discriminant analysis, as well as variation partitioning. The final part provides practical considerations to help researchers avoid pitfalls and choose the most appropriate methods. Data types and data preparation Data sets a priori a priori Data transformations In multivariate data tables, measured variables can be binary, quantitative, qualitative, rank-ordered, classes, frequencies, or even a mixture of those types. If variables do not have a uniform scale (e.g. environmental parameters measured in different units or scales) or an adequate format, variables have to be transformed before performing further analyses. Each qualitative variable has to be recoded as a set of numerical variables that replace it in the numerical calculations. One way to do so is to create a series of ‘dummy’ variables that correspond to all the states of the qualitative variable. For instance, if the variable ‘season’ has to be recoded, four associated variables will be constructed, and for each object the value 1 will be given to the corresponding season when it occurs, and 0 for the three other seasons when it is absent. Many statistical packages automatically perform this recoding. Standardization z Normalizing x x x c x c c c x c Legendre & Legendre, 1998 Eq. (1) Legendre & Gallagher, 2001 Eq. (2) (1) (2) y ij i j y i i p Legendre & Gallagher, 2001 Sokal & Rohlf (1995) Legendre & Legendre (1998) Legendre & Legendre, 1998 S D D S D S D S 2 D D D max D D min D max D min D max D min D Legendre & Legendre, 1998 Exploratory analyses Visualization and exploration of complex data sets The basic aim of ordination and cluster analysis is to represent the (dis)similarity between objects (e.g. samples, sites) based on values of multiple variables (columns) associated with them, so that similar objects are depicted near from each other and dissimilar objects are found further apart from each other. Exploratory multivariate analyses are thus useful to reveal patterns in large data sets, but they do not directly explain why those patterns exist. This latter point is addressed in the third part of the review. Cluster analysis and association coefficients James & McCulloch, 1990 Legendre & Legendre, 1998 Legendre & Legendre, 1998 Table 1 Fig. 1 Avise, 2006 et al et al et al et al et al k Q R Legendre & Legendre, 1998 Q R Asymmetric Jaccard (1901) Sørensen (1948) Legendre & Legendre, 1998 k hierarchical clustering a priori Sneath & Sokal, 1973 nearest neighbor furthest neighbor Sneath & Sokal, 1973 Ward's method Legendre & Legendre, 1998 weighted arithmetic average clustering Legendre & Legendre, 1998 k-means clustering k k F k Legendre & Legendre, 1998 Two-step cluster analysis et al Principal component analysis (PCA) Table 1 Merrill & Halverson, 2002 et al et al Hotelling, 1933 variance–covariance correlation matrix Table 2 Table 2 Interpretation of ordination diagrams Linear methods (PCA, RDA) PCA, RDA RDA Scaling 1 Scaling 2 Samples Species ENV NENV Focus on sample (rows) distance Focus on species (columns) correlation ✓ Euclidean distances among samples – ✓ – Linear correlations among species ✓ Marginal effects of ENV on ordination scores Correlations among ENV ✓ Euclidean distance between sample classes – ✓ ✓ Abundance values in species data ✓ ✓ – Values of ENV in the samples ✓ ✓ Membership of samples in the classes ✓ ✓ Linear correlations between species and ENV ✓ ✓ Mean species abundance within classes of nominal ENV ✓ ✓ – Average of ENV within classes Unimodal methods (CA, CCA) CA, CCA CCA Focus on sample (rows) distance and Hill's scaling Focus on species (columns) distances ✓ Turnover distances among samples 2 ✓ - 2 ✓ Marginal effects of ENV Correlations among ENV ✓ Turnover distances between sample classes 2 ✓ ✓ Relative abundances of the species table Relative abundances of the species table ✓ ✓ – Values of ENV in the samples ✓ ✓ Membership of samples in the classes ✓ ✓ Weighted averages – the species optima in respect to particular ENV ✓ ✓ Relative total abundances in the sample classes ✓ ✓ – ENV averages within sample classes ter Braak (1994) Leps & Smilauer (1999) ter Braak & Smilauer (2002) Jolicoeur & Mosimann, 1960 Fig. 1a Table 2 Table 2 Table 1 Legendre & Legendre, 1998 loading Principal coordinate analysis (PCoA) Table 1 Gower, 1966 et al Staphylococcus aureus et al Legendre & Legendre, 1998 Objects are represented as points in the ordination space. Eigenvalues are also used here to measure how much variance is accounted for by the largest synthetic variables on each PCoA synthetic axis. Although there is no direct, linear relationship between the components and the original variables, it is still possible to correlate object scores on the main axis (or axes) with the original variables to assess their contribution to the ordination. Correspondence analysis (CA) et al et al b et al Hill, 1974 reciprocal averaging Hill, 1974 2 Gauch, 1982 ter Braak, 1985 ter Braak & Prentice, 1988 ter Braak, 1985 Ramette & Tiedje, 2007a b 2 Legendre & Legendre, 1998 Fig. 2b 2 Table 2 Fig. 2 Table 2 Legendre & Legendre, 1998 Gauch, 1982 ter Braak, 1987 Legendre & Legendre, 1998 detrending ter Braak & Prentice, 1988 Legendre & Legendre, 1998 James & McCulloch, 1990 Nonmetric multidimensional scaling (NMDS) Table 1 et al et al et al Shepard, 1966 a priori Legendre & Legendre, 1998 In NMDS ordination, the proximity between objects corresponds to their similarity, but the ordination distances do not correspond to the original distances among objects. Because NMDS preserves the order of objects, NMDS ordination axes can be freely rescaled, rotated, or inverted, as needed for a better visualization or interpretation. Because of the iterative procedure, NMDS is more computer intensive than eigenanalyses such as PCoA, PCA, or CA. However, constant improvement in computing power makes this limitation less of a problem for small- to medium-sized matrices. Testing for significant differences between groups anova (npmanova) anosim Legendre & Legendre, 1998 npmanova Anderson, 2001 F anova T 2 post hoc P P P P Legendre & Legendre, 1998 anosim Clarke, 1993 R R R R R R Clarke & Gorley, 2001 manova Legendre & Legendre, 1998 et al anosim et al anosim Nelson & Mele, 2007 Environmental interpretation Exploratory analyses may reveal the existence of clusters or groups of objects in a data set. When a supplementary table or matrix of environmental variables is available for those objects, it is then possible to examine whether the observed patterns are related to environmental gradients. Typical objectives may be, for instance, to reveal the existence of a relationship between community structure and habitat heterogeneity, between community structure and spatial distance, or to identify the main variables affecting bacterial communities when a large set of environmental variables has been conjointly collected. Legendre & Legendre, 1998 Indirect gradient analyses anova Legendre & Legendre, 1998 canoco Oksanen, 2007 et al Direct gradient analyses (constrained analyses) Table 2 Redundancy analysis (RDA) et al et al et al et al Rao, 1964 Legendre & Anderson, 1999 Ramette & Tiedje, 2007b ter Braak, 1994 Table 2 Canonical correspondence analysis (CCA) ter Braak, 1986 ter Braak & Smilauer, 2002 Legendre & Legendre, 1998 Table 2 Yannarell & Triplett, 2005 et al et al et al et al indicator species Yannarell & Triplett, 2005 Burkholderia et al et al et al Partial ordination, variation partitioning et al variation partitioning et al Fig. 3 Legendre & Legendre, 1998 et al Fig. 3 Partitioning biological variation into the effects of two factors. The large rectangle represents the total variation in the biological data table, which is partitioned among two sets of explanatory variables (a, b). Fraction 4 shows the unexplained part of the biological variation. Fractions 1 and 3 are obtained by partial constrained ordination or partial regression, and can be tested for significance. For instance, fraction 1 corresponds to the amount of biological variation that can be exclusively explained by (a) effects when (b) effects are taken into consideration (i.e., when b is considered as a covariable). Fraction 2 [i.e., variation indifferently attributed to (a) and (b) or a covariation of (a) and (b)] is obtained by subtracting fractions 1 and 3 from the total explained variance, and cannot be tested for statistical significance. Ramette & Tiedje (2007b) Burkholderia et al Linear discriminant analysis (LDA) manova et al Mahalanobis, 1936 Legendre & Legendre, 1998 Selection of variables in regression models Legendre & Legendre, 1998 In forward selection, the construction of the regression model starts with the variable that explains the most variation in the dependent variables (generally the species table). What remains of the biological variation to explain after fitting the first environmental variable (i.e. of the residual variation) is then used to choose the second environmental variable. The process of selection goes on until no more variables significantly explain the residual variation. In backward elimination, the construction of the regression model starts with all environmental variables and the least significant ones are excluded from the model, one at a time until a group of only ‘significant’ variables is obtained. To take advantage of the two approaches, stepwise regression mixes forward selection with backward elimination by performing a forward selection, but excluding the variables that no longer become significant after the introduction of new variables into the regression model. Legendre & Legendre, 1998 James & McCulloch, 1990 Ramette & Tiedje (2007b) Mantel test Mantel, 1967 Parker & Spoerke, 1998 Cho & Tiedje, 2000 et al et al Legendre & Legendre, 1998 Legendre & Legendre, 1998 Practical considerations Fig. 4 ter Braak & Prentice, 1988 Legendre & Legendre, 1998 ter Braak & Smilauer, 2002 canoco ter Braak & Smilauer, 2002 ter Braak & Smilauer, 2002 Fig. 4 Relationships between numerical methods. Exploratory tools such as PCA, CA, PCoA, NMDS, or cluster analysis can be applied to a sample-by-species table to extract the main patterns of variation, to identify groups or clusters of samples, or specific species interactions. Sample scores on the main axes of variation can be related to variation in environmental variables using indirect gradient analyses. When a constrained analysis is desired (i.e. direct gradient analysis), RDA, db-RDA, CCA, or linear discriminant analysis can be used as extensions of the unconstrained methods. Mantel tests are appropriate to test the significance of the correlation between two distance matrices (e.g. one based on species data and the other on environmental variables). Raw data may be transformed, normalised or standardised as appropriate before analysis. Data type is also another important criterion. To represent absolute abundance values, linear-based methods (PCA, RDA), which produce weighted summations, are appropriate, whereas unimodal techniques (CCA, CA) are rather used to model relative abundances (because species scores are weighted averages of the samples scores, and vice versa), i.e. they model the dissimilarities between samples (β diversity). They also accommodate well the presence of many zeros in the species table, in contrast to linear-based methods for which double zeros lead to inadequate estimates of sample distances. Legendre & Legendre, 1998 If one assumes that species do not have a linear response to environmental gradients, NMDS is more appropriate than PCA. CA may also be an alternative to PCA when many zeros populate the data set and one strong gradient is present. With long ecological gradients, however, CA may produce the arch effect that can be corrected for using DCA. In terms of the underlying species model, the main difference between DCA and NMDS is that the former is based on a specific model of species distributions (unimodal model), while NMDS is not. Thus, DCA may be favored by ecologists who assume that the niche theory better fits their data set, while NMDS may be a method of choice if species composition is determined by factors other than position along a gradient (for instance if the habitat is known to be fragmented). ter Braak & Prentice, 1988 manova Legendre & Legendre, 1998 Fig. 4 Procrustes Gower, 1975 Peres-Neto & Jackson, 2001 Fig. 5 Legendre & Legendre, 1998 Fig. 5 Legendre & Legendre (1998) Ordination and diversity indices Whittaker, 1972 Hartman & Widmer (2006) et al Misconceptions about multivariate analyses James & McCulloch, 1990 Another common misconception is that multivariate analyses alone can sort out all solutions of complex multivariate studies. Although exploratory analyses may help reveal interesting patterns in data sets, the interpretation and explanation of the observations ultimately rely on the researcher's hypotheses and previous knowledge of the ecological situation. Microbial ecologists themselves need to formulate ecologically sound hypotheses and test them. Conclusions Exciting questions in Ecology typically consist of determining whether community patterns are structured across space or time, of explaining how those patterns can be related to environmental heterogeneity, and of quantifying how much still remains unexplained when all significant, measured variables have been considered. Such questions can now start to be addressed in microbial ecology because numerical tools may help explore and test such ecological hypotheses. These are indeed exciting times because even larger and more complex databases are being created and in parallel, computing power gradually becomes less of an issue. If microbial ecologists want to test numerical methods, develop new ecological theories, or validate existing ones for the microbial case, access to diversity data and above all, to the relevant associated environmental parameters, becomes a central issue. It would thus be of great interest to make such complex data sets publicly available, such as microbial ecological databases, so that microbial diversity can be studied in its environmental context. This would indeed be a step toward making microbial ecology a central discipline in Ecology.