VizRank : Data Visualization Guided by Machine Learning

About this supplement

Installation

How to use VizRank

VizRank's graphical interface

Orange homepage

Datasets

Spotlight

Try VizRank online - You can now experiment with VizRank online. Find interesting data projections of your own data sets.

FRI > Biolab > Supplements

This page contains supplemental material for the following paper submitted to Data Mining and Knowledge Discovery journal:

VizRank : Data Visualization Guided by Machine Learning
Gregor Leban, Blaž Zupan, Gaj Vidmar and Ivan Bratko

Abstract

Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics.

Data sets

Data sets that were used in our experiment are:

wine - results of chemical analysis of three types of wine
voting - United States congressional voting records from 1984
imports-85 - data set about car imports in 1985. Class value was discretized.
housing - concerns housing values in suburbs of Boston
credit - Japanese credit screening
circlet - upper limb motion using haptic interface

yeast - data set on budding yeast Saccharomyces cerevisiae.

Experiments

Psychological experiment

Here are the projections that were used in our psychological experiment, where we evaluated the agreement of human-made assessments of projections with assessments made with VizRank.

Here are the selected data projections that were evaluated by raters:

Data set	Visualization method
wine	scatterplot	radviz
voting	scatterplot	radviz
imports-85	scatterplot	radviz
housing	scatterplot	radviz
credit	scatterplot	radviz
circlet	scatterplot	radviz

Interesting projections from the case study on the yeast data set

We found several interesting and biologically relevant projections of the yeast data set.
Here we present best 10 projections with scatterplot and radviz method.

Performance of the search heuristic

We implemented an efficient search heuristic, that enables VizRank to evaluate only a small subset of possible projections in order to find the best ones.
Here are the results where we empirically tested the heuristic on a number of data sets.

VizRank topics

Installation, where you can download Orange, our data mining package, where VizRank is implemented.
How to use VizRank in Orange Canvas, where we shortly describe our data mining suite and how to apply VizRank on your own data sets.
VizRank's graphical interface, where we show and describe VizRank's user interface.

References:

Bardorfer A, Munih M, Zupan A. 2001. Upper limb motion analysis using haptic interface. IEEE/ASME Transactions on Mechatronics 6(3):253--260.

Blake C, Merz C. 1998. UCI repository of machine learning databases.

Brier GW. 1950. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review 78:1--3.

Broder AJ. 1990. Strategies for efficient incremental nearest neighbor search. Pattern Recognition 23(1--2):171--178.

Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet C, Furey TS, M. Ares J, Haussler D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences 97(1):262--267.

Cleveland WS, McGill R. 1984. The many faces of a scatter plot. Journal of the American Statistical Association 79(388):807--22.

Cleveland WS. 1993. Visualizing data. Summit, New Jersey: Hobart Press.

Cleveland WS. 1994. The elements of graphing data. Summit, New Jersey: Hobart Press.

Cutting JE, Vishton PM. 1995. Perceiving layout and knowing distances: The integration, relative potency, and contextual use of different information about depth. Handbook of perception and cognition. San Diego, CA: Academic Press. p 69--117.

Dasarathy BW. 1991. Nearest neighbor (NN) norms: NN pattern classification techniques. Las Alamitos, CA: IEEE Computer Society Press.

Demsar J, Zupan B. 2004. From Experimental Machine Learning to Interactive Data Mining, A White Paper. AI Lab, Faculty of Computer and Information Science, Ljubljana.

DeRisi J, Iyer V, Brown P. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680--6.

Diaconis P, Friedman D. 1984. Asymptotics of graphical projection pursuit. Annals of Statistics 1(12):793-815.

Dillon I, Modha D, Spangler W. Visualizing class structure of multidimensional data; 1998; Minneapolis, MN.

Duda RO, Hart PE, Stork DG. 2001. Pattern Classification: John Wiley and Sons, Inc.

Friedman JH, Tukey JW. 1974. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers 23:881-890.

Friedman JH, Bentley JL, Finkel R. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 3(3):209-22.

Grinstein G, Trutschl M, Cvek U. 2001. High-dimensional visualizations. Proceedings of the Visual Data Mining Workshop, KDD.

Harris RL. 1999. Information Graphics: A comprehensive illustrated reference. New York: Oxford Press. p 290-297.

Hastie T, Tibshirani R, Friedman J. 2001. The Elements of Statistical Learning: Springer.

Hoffman PE, Grinstein GG, Marx K, Grosse I, Stanley E. 1997. DNA Visual and Analytic Data Mining. IEEE Visualization 1997 1:437-441.

Hoffman P, Grinstein. G. 1999. Dimensional anchors: A graphic primitive for multidimensional multivariate information visualizations. Proc. of the NPIV 99.

Huber P. 1985. Projection pursuit (with discussion). Annals of Statistics 13:435-525.

Inselberg A. 1981. n-dimensional graphics, part i-lines and hyperplanes: IBM Los Angeles Scientific Center.

Kaski S, Peltonen J. 2003. Informative discriminant analysis. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) 1:329-336.

Keim DA, Kriegel H. 1996. Visualization techniques for mining large databases: A comparison. Transactions on Knowledge and Data Engineering, Special Issue on Data Mining 8(6):923-938.

Kononenko I, Simec E. 1995. Induction of decision trees using RELIEFF. Mathematical and statistical methods in artificial intelligence: Springer Verlag.

Nason G. 1992. Design and Choice of Projection Indices: University of Bath.

Schucany W, Frawley W. 1973. A rank test for two group concordance. Psychometrika 2(38):249-258.

Siegel S, Castellan J. 1988. Nonparametric statistics for the behavioral sciences: McGraw-Hill.

Torkkola K. 2002. On feature extraction by mutual information maximization. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing 1:821-824.

Torkkola K. 2003. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research 3:1415-1438.