Spotlight

Try VizRank online - You can now experiment with VizRank online. Find interesting data projections of your own data sets.


FRI > Biolab > Supplements

This page contains supplemental material for the following paper submitted to Data Mining and Knowledge Discovery journal:

VizRank : Data Visualization Guided by Machine Learning
Gregor Leban, Blaž Zupan, Gaj Vidmar and Ivan Bratko


Abstract

Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics.


Data sets

Data sets that were used in our experiment are:

wine - results of chemical analysis of three types of wine
voting - United States congressional voting records from 1984
imports-85 - data set about car imports in 1985. Class value was discretized.
housing - concerns housing values in suburbs of Boston
credit - Japanese credit screening
circlet - upper limb motion using haptic interface

yeast - data set on budding yeast Saccharomyces cerevisiae.


Experiments

Psychological experiment

Here are the projections that were used in our psychological experiment, where we evaluated the agreement of human-made assessments of projections with assessments made with VizRank.

Here are the selected data projections that were evaluated by raters:

Data set
Visualization method
wine
voting
imports-85
housing
credit
circlet

Interesting projections from the case study on the yeast data set

We found several interesting and biologically relevant projections of the yeast data set.
Here we present best 10 projections with scatterplot and radviz method.

Performance of the search heuristic

We implemented an efficient search heuristic, that enables VizRank to evaluate only a small subset of possible projections in order to find the best ones.
Here are the results where we empirically tested the heuristic on a number of data sets.


VizRank topics


References:

Bardorfer A, Munih M, Zupan A. 2001. Upper limb motion analysis using haptic interface. IEEE/ASME Transactions on Mechatronics 6(3):253--260.

Blake C, Merz C. 1998. UCI repository of machine learning databases.

Brier GW. 1950. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review 78:1--3.

Broder AJ. 1990. Strategies for efficient incremental nearest neighbor search. Pattern Recognition 23(1--2):171--178.

Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet C, Furey TS, M. Ares J, Haussler D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences 97(1):262--267.

Cleveland WS, McGill R. 1984. The many faces of a scatter plot. Journal of the American Statistical Association 79(388):807--22.

Cleveland WS. 1993. Visualizing data. Summit, New Jersey: Hobart Press.

Cleveland WS. 1994. The elements of graphing data. Summit, New Jersey: Hobart Press.

Cutting JE, Vishton PM. 1995. Perceiving layout and knowing distances: The integration, relative potency, and contextual use of different information about depth. Handbook of perception and cognition. San Diego, CA: Academic Press. p 69--117.

Dasarathy BW. 1991. Nearest neighbor (NN) norms: NN pattern classification techniques. Las Alamitos, CA: IEEE Computer Society Press.

Demsar J, Zupan B. 2004. From Experimental Machine Learning to Interactive Data Mining, A White Paper. AI Lab, Faculty of Computer and Information Science, Ljubljana.

DeRisi J, Iyer V, Brown P. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680--6.

Diaconis P, Friedman D. 1984. Asymptotics of graphical projection pursuit. Annals of Statistics 1(12):793-815.

Dillon I, Modha D, Spangler W. Visualizing class structure of multidimensional data; 1998; Minneapolis, MN.

Duda RO, Hart PE, Stork DG. 2001. Pattern Classification: John Wiley and Sons, Inc.

Friedman JH, Tukey JW. 1974. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers 23:881-890.

Friedman JH, Bentley JL, Finkel R. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 3(3):209-22.

Grinstein G, Trutschl M, Cvek U. 2001. High-dimensional visualizations. Proceedings of the Visual Data Mining Workshop, KDD.

Harris RL. 1999. Information Graphics: A comprehensive illustrated reference. New York: Oxford Press. p 290-297.

Hastie T, Tibshirani R, Friedman J. 2001. The Elements of Statistical Learning: Springer.

Hoffman PE, Grinstein GG, Marx K, Grosse I, Stanley E. 1997. DNA Visual and Analytic Data Mining. IEEE Visualization 1997 1:437-441.

Hoffman P, Grinstein. G. 1999. Dimensional anchors: A graphic primitive for multidimensional multivariate information visualizations. Proc. of the NPIV 99.

Huber P. 1985. Projection pursuit (with discussion). Annals of Statistics 13:435-525.

Inselberg A. 1981. n-dimensional graphics, part i-lines and hyperplanes: IBM Los Angeles Scientific Center.

Kaski S, Peltonen J. 2003. Informative discriminant analysis. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) 1:329-336.

Keim DA, Kriegel H. 1996. Visualization techniques for mining large databases: A comparison. Transactions on Knowledge and Data Engineering, Special Issue on Data Mining 8(6):923-938.

Kononenko I, Simec E. 1995. Induction of decision trees using RELIEFF. Mathematical and statistical methods in artificial intelligence: Springer Verlag.

Nason G. 1992. Design and Choice of Projection Indices: University of Bath.

Schucany W, Frawley W. 1973. A rank test for two group concordance. Psychometrika 2(38):249-258.

Siegel S, Castellan J. 1988. Nonparametric statistics for the behavioral sciences: McGraw-Hill.

Torkkola K. 2002. On feature extraction by mutual information maximization. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing 1:821-824.

Torkkola K. 2003. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research 3:1415-1438.