Introduction 1 2 The aim of this work is to investigate whether the biofilms grown in different aquatic systems can be discriminated on the basis of their chemical compositions. If the discrimination is possible, the next interesting question will be which measured parameters are responsible for it. 3 Experimental Description of the sampling procedure 1 Table 1 Description of the biofilm and water samples collected Group of biofilms Subgroup of biofilms Character of the water phase Number of samples Systematically sampled biofilms Biofilms grown on polycarbonate plates f—the Saale river 11 s—the Teich pond 22 Biofilms grown on natural substrates f—the Leutra river 13 Uniquely sampled biofilms Biofilms grown on natural substrates f—Celle (a, b), Lauscha (a, b, c), Oberpöllnitz, Falken, London, Munich, New York, Geithain, Steinach, Juquitiba 12 s—Chemnitz, New Hampshire, Bossow, Metebach, Erfurt, Rippachtal 6 m—Travemünde (a, b), Punta Skala, Nin, Majorca, Damp, Steinbeck 7 a b c f s m 4 A small pond of 0.75-m depth was selected as a typical example of a standing water body. It was located in the city of Jena. The sampling campaign duration was the same as for the sampling campaign in the Saale river. At these two locations, the biofilms were artificially grown on polycarbonate plates (10 cm × 10 cm) exposed vertically to the water (in the Saale river, in a streaming direction). The plates were fixed into polypropylene boxes, approximately 10 cm under the water surface and 1 m away from the riverbank. After a definite time of exposure, the biofilm samples were immediately transferred into plastic boxes filled with the river or pond water and transported to the laboratory. The plates were then washed with bidistilled water and the biofilm samples were scraped off the whole surface of the polycarbonate plates using a Teflon spatula. The river and pond water samples were collected every 2 weeks. The Leutra river, located in the city of Jena, was chosen as the second example of a flowing water body. The stony bed of the river, its small depth and good accessibility facilitated the sampling campaign. The biofilm samples from the Leutra river were scraped off the riverbed stones using a plastic spatula. They were placed into polyethylene bottles and transported to the laboratory. The sampling campaign at the Leutra river was held in autumn 2005 and in spring 2006. Additionally, water samples were collected. The sampling procedure, for water and uniquely sampled biofilms, was the same as that carried out for the Leutra river. The locations of the sampling sites were selected according to the availability of a suitable sampling device. The samples collected were also placed into polyethylene bottles and transported to the laboratory. Analytical procedure The biofilm samples collected were air-dried at 105 °C. Then, the samples containing 10–50 mg biofilm powder were dissolved in 3 ml of 70–72% perchloric acid and were heated for 3 h at 50 °C. The remaining dry matter of each sample was further dissolved in water so that the resulting solution was up to 5 ml. A Fisons Instruments (Beverly, MA, USA) Maxim 112 inductively coupled plasma optical emission spectrometer was used to analyse Al, Ca, Fe, K, Mg, Mn, Na and Sr, while Cd, Co, Cr, Cu, Ni, P, Se and Zn were determined with a PerkinElmer (Wellesley, MA, USA) Elan 6000 inductively coupled plasma mass spectrometer. An external aqueous calibration was adopted for the analysis by ICP-OES, while a standard addition procedure was used for the element analysis by ICP-MS. All contents correspond to the sample dry weight. The trueness of the measurements was tested by analysing a certified reference algae material. The element contents were certified for an aqua regia digestion. Additionally, the element contents were determined after microwave digestion with nitric acid. No differences between these two digestion methods were obtained. All the measurements were done in triplicate and the relative standard deviation of the technique was 10–15% for all the biofilms, indicating good repeatability of the measurements. Theory Classification and regression trees 5 y 5 5 Owing to a binary data splitting, the results of CART can easily be visualised as a binary tree, which consists of a number of nodes symbolising subgroups of data objects. 5 Discriminant partial least squares n X m n y y m f T m f X y 6 6 7 Uninformative variable elimination–discriminant partial least squares y 8 N m p p X m n Z m n+p –10 m m m The goodness of a discrimination model is characterised by the percentage of correct classification or the so-called correct classification rate. It is commonly agreed that the higher the correct classification rate, the better the model. Additionally, one should consider sensitivity and selectivity of the model. For a two-class problem for instance, sensitivity is defined as the percentage of correctly classified samples of class A, while selectivity is the percentage of correctly classified samples of class B. Results and discussion X 1 9 X m n T m n P n n P 1 Fig. 1 a b c d e 1 1 1 1 1 1 10 In order to see whether the biofilms developed in standing water could be distinguished from the biofilms grown in flowing water, supervised approaches such as CART, DPLS and UVE-DPLS were applied. Furthermore, it was important to determine if the models constructed could predict the origin of new biofilm samples and how well. Another question to be answered was what variables are responsible for an eventual discrimination of groups. Only seven biofilms were grown in seawater; therefore, they were excluded from the forthcoming analysis. 11 12 13 Results of CART, DPLS and UVE-DPLS for model and test sets designed with the Kennard and Stone algorithm 2 Fig. 2 f s 10 −1 2 Table 2 CCR Selection of model and test sets Kennard and Stone Duplex Technique a DPLS b c DPLS d Flowing water vs. standing water samples CCR (%) 100.0 81.8 90.9 86.4 86.4 86.4 Sensitivity (%) 100.0 73.3 86.7 100.0 100.0 100.0 Selectivity (%) 100.0 100.0 100.0 57.1 57.1 57.1 CART DPLS UVE a b c d 14 2 2 The best discrimination results are obtained from CART, even though this model shows a misclassification error of 9.5% for the complete tree. Since the splits are done in a univariate way, the correlation between variables is not taken into account. Therefore, CART provides unsatisfactory results when a linear combination of variables is responsible for discriminating the samples. This, however, cannot be verified unless multivariate approaches such as DPLS and UVE-DPLS are used. Although CART and UVE-DPLS have different objective functions, common variables are selected as essential for the discrimination. The primary variable, W-Mg, and two competitive variables, W-Ca and W-Sr, in CART are also selected by UVE-DPLS. Results of CART, DPLS and UVE-DPLS for model and test sets designed with the duplex algorithm Results of CART, DPLS and UVE-DPLS were obtained using data designed with the duplex algorithm, which ensures the representativeness of the model and test sets. 2 2 Results of CART, DPLS and UVE-DPLS for biofilm samples grown on natural substrates 1 3 Fig. 3 a b c d e 3 3 3 3 1 4 Fig. 4 f s 7 Conclusions Discrimination between sea biofilms and the remaining standing water and flowing water biofilms is straightforward by investigating the score plots obtained from PCA. The loading plots emphasise the expected higher salt content of the water phases extracted from the sea biofilms as well as their higher levels of Fe and Mg in comparison with the other biofilms. A further discrimination between flowing water and standing water biofilms is possible by means of supervised methods like CART, DPLS and UVE-DPLS. The best discriminant model is obtained from CART. One variable describing the Mg content in the water phase (W-Mg) is enough to build a model with 9.5% misclassification error. All test samples selected by the Kennard and Stone algorithm are correctly classified using the constructed CART model. The DPLS and UVE-DPLS methods do not outperform CART for the data set studied and, therefore, it can be pointed out that a linear combination of explanatory variables does not lead to a better prediction for new samples. Moreover, CART appears as a very simple and efficient discriminant technique leading to a straightforward data interpretation in terms of explanatory variables. Hence, CART can be considered as a pilot discriminant approach. When the CART model is not satisfactory, one can apply discriminant methods, such as DPLS and UVE-DPLS, or if necessary to use a nonlinear multivariate classifier like, e.g., support vector machines. All discriminant models, CART, DPLS and UVE-DPLS, lead to 86.4% correct classification for the test set designed by the duplex algorithm. However, CART uses only one variable (W-Mg), UVE-DPLS selects nine variables and DPLS uses all explanatory variables to build the model. Discrimination of flowing water and standing water biofilms that are uniquely sampled, using CART, DPLS and UVE-DPLS models, is done only for a better understanding of the data collected. For a definite conclusion whether these two groups of samples can be discriminated, more samples are required to properly validate the discriminant models.