Description

This page contains supplemental material for the following paper submitted to Bioinformatics:

A combinatorial approach to graphlet counting
Tomaž Hočevar and Janez Demšar

Abstract

Motivation: Graphlets are emerging as a possible tool for exploration of global and local structure of networks, and for analysis of roles of individual nodes. One of the obstacles to their wider use is the computational complexity of algorithms for their discovery and counting.
Results: We propose a new, combinatorial method for counting graphlets and orbit signatures of network nodes. The algorithm builds a system of equations that connect counts of orbits from graphlets with up to 5 nodes, which allows to compute all orbit counts by enumerating just a single one. This reduces the time complexity in sparse graphs by an order of magnitude as compared to the existing, pure enumeration based algorithms.
Availability: Source code is available freely at http://www.biolab.si/supp/orca

Download

Orca is free software provided under GNU General Public License. You can download the C++ source code and the test data that was used in our evaluation experiments.

  • Source code: orca.zip

    Also includes a windows binary which should be run from Command Prompt.

  • Test data: data.zip

    This archive contains random networks that were used as a benchmark.

Installation instructions

Orca is implemented as a simple command-line utility that consists of a single C++ source file. It has no external dependencies and requires only a relatively modern compiler that supports unordered_map from the C++11 standard, which most do.

Current GCC compiler requires a -std=c++11 flag. For example, you can compile the source using MinGW compiler on Windows with:

g++ -O2 -std=c++11 -o orca.exe orca.cpp

Usage

The utility takes three command-line arguments.

orca.exe 5 example.in example.out
  1. graphlet size

    This indicates the size of graphlets that you wish to count and should be either 4 or 5.

  2. input file

    Input file describes the network in a simple text format. The first line contains two integers n and e - the number of nodes and edges. The following e lines describe undirected edges with space-separated ids of their endpoints. Node ids should be between 0 and n-1 (see example.in).

  3. output file

    Output file will consist of n lines, one for each node in a graph from 0 to n-1. Every line will contain 15 or 73 space-separated orbit counts depending on the specified graphlet size.

The computation consists of three stages. Orca will display the time required for previous stages and progress of the current stage.

nodes: 18170
edges: 137775
max degree: 9716
stage 1 - precomputing common nodes
0.37
stage 2 - counting full graphlets
0.18
stage 3 - building systems of equations
37%