Graph Data Structure

Many machine learning algorithms (e.g. Isomap, UMAP, etc.) employ graph data structures internally. Graphs are mathematical structures used to model pairwise relations between objects from a certain collection. A graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices (also called nodes or points), and the links that connect some pairs of vertices are called edges (also called lines). The edges may be directed (asymmetric) or undirected (symmetric).

There are different ways to store graphs in a computer system. The data structure used depends on both the graph structure and the algorithm used for manipulating the graph. Theoretically one can distinguish between list and matrix structures but in concrete applications the best structure is often a combination of both. List structures are often preferred for sparse graphs as they have smaller memory requirements. Matrix structures on the other hand provide faster access for some applications but can consume huge amounts of memory. In Smile, we support both adjacency list (class AdjacencyList) and adjacency matrix (class AdjacencyMatrix).

In adjacency list, each vertex has a list of which vertices it is adjacent to. This causes redundancy in an undirected graph. Adjacency queries are faster, at the cost of extra storage space.

An adjacency matrix is an n by n matrix A, where n is the number of vertices in the graph. If there is an edge from a vertex x to a vertex y, then the element A(x,y) is 1 (or in general the number of edges or a weight), otherwise it is 0. In computing, this matrix makes it easy to find subgraphs, and to reverse a directed graph.

Both AdjacencyList and AdjacencyMatrix implement the abstract interface Graph. The method dot() returns the graphic representation in Graphviz dot format. You may try http://viz-js.com/ to visualize the returned string.

Java


    import smile.graph.*;
            
    var graph = new AdjacencyList(8);
    graph.addEdge(0, 2);
    graph.addEdge(1, 7);
    graph.addEdge(2, 6);
    graph.addEdge(7, 4);
    graph.addEdge(3, 4);
    graph.addEdge(3, 5);
    graph.addEdge(5, 4);

    String[] label = {"a", "b", "c", "d"}
    graph.dot("SmileGraph", label);

In this example, we create a DOT graph with name 'SmileGraph' with 4 node labels. Since we have 8 nodes, the rest of nodes will use their integer ID as label.

Graph Traversal

The basic operation on graph is traversal, i.e. to visit the vertices in some systematic order. Typically, we have breadth first search (BFS) and depth first search (DFS). Both of these construct spanning trees with certain properties useful in other graph algorithms. The generic methods bfs(VertexVisitor) and dfs(VertexVisitor) take a user-define VertexVisitor functor to perform specific operations when visiting a vertex.

Java


    graph.dfs(i -> println("Visiting vertex " + i));

Connected Components

An application of graph traversal is to compute connected components. A connected component of an undirected graph is a connected subgraph that is not part of any larger connected subgraph. The components of any graph partition its vertices into disjoint sets, and are the induced subgraphs of those sets. In Graph class, the methods bfcc() and dfcc() compute the connected components with BFS and DFS algorithms, respectively. These methods return a two-dimensional array of which each row is the vertex IDs in the same connected component.

Java


    // compute connected components with BFS
    graph.bfcc();
    // compute connected components with DFS
    graph.dfcc();

A strongly connected component of a directed graph is a maximal subgraph where every pair of vertices is mutually reachable. Note that a conventional DFS method cannot be used to find the strongly connected components. Instead, one may leverage Kosaraju's algorithm or Tarjan's algorithm to compute strongly connected components.

Topological Sort

With DFS or BFS, we can also obtain the topological ordering of a directed graph, which is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG).

Java


    var graph = new AdjacencyList(13, true);
    graph.addEdge(8, 7);
    graph.addEdge(7, 6);
    graph.addEdge(0, 1);
    graph.addEdge(0, 2);
    graph.addEdge(0, 3);
    graph.addEdge(0, 5);
    graph.addEdge(0, 6);
    graph.addEdge(2, 3);
    graph.addEdge(3, 4);
    graph.addEdge(3, 5);
    graph.addEdge(6, 4);
    graph.addEdge(6, 9);
    graph.addEdge(4, 9);
    graph.addEdge(9, 10);
    graph.addEdge(9, 11);
    graph.addEdge(9, 12);
    graph.addEdge(11, 12);

    graph.dfsort(); // topological sort with DFS
    graph.bfsort(); // topological sort with BFS

Minimum Spanning Tree

A minimum spanning tree (MST) is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. There are many use cases for minimum spanning trees. With the method prim, we can calculate an MST by Prim's algorithm.

Java


    var graph = new AdjacencyMatrix(6);
    graph.addEdge(0, 1, 0.41);
    graph.addEdge(1, 2, 0.51);
    graph.addEdge(2, 3, 0.50);
    graph.addEdge(4, 3, 0.36);
    graph.addEdge(3, 5, 0.38);
    graph.addEdge(3, 0, 0.45);
    graph.addEdge(0, 5, 0.29);
    graph.addEdge(5, 4, 0.21);
    graph.addEdge(1, 4, 0.32);
    graph.addEdge(4, 2, 0.32);
    graph.addEdge(5, 1, 0.29);
    
    List mst = new ArrayList<>();
    double cost = graph.prim(mst);
    mst.forEach(edge -> println(edge))

Shortest Path

With the method dijkstra(), we can calculate the shortest path from a source to all other vertices in the graph by Dijkstra algorithm. If no parameter is provided, Smile will calculate all pairwise shortest path between all vertecies. Many manifold algorithms employ the shortest path among the samples as a similarity measure instead of pairwise distance.

Java


    var graph = new AdjacencyMatrix(6, true);
    graph.addEdge(0, 1, 0.41);
    graph.addEdge(1, 2, 0.51);
    graph.addEdge(2, 3, 0.50);
    graph.addEdge(4, 3, 0.36);
    graph.addEdge(3, 5, 0.38);
    graph.addEdge(3, 0, 0.45);
    graph.addEdge(0, 5, 0.29);
    graph.addEdge(5, 4, 0.21);
    graph.addEdge(1, 4, 0.32);
    graph.addEdge(4, 2, 0.32);
    graph.addEdge(5, 1, 0.29);

    double[][] distances = graph.dijkstra();

Travelling Salesman Problem

The travelling salesman problem (TSP) is another important task in graph theory, combinatorial optimization, and operations research. It asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It is an NP-hard problem. Smile provides the Held-Karp algorithm to compute the optimal solution of TSP. It is an efficient dynamic programming algorithm, significantly better than the superexponential performance {\displaystyle \Theta (n!)} of a brute-force algorithm.

Java


    var graph = new AdjacencyMatrix(6);
    graph.addEdge(0, 1, 0.41);
    graph.addEdge(1, 2, 0.51);
    graph.addEdge(2, 3, 0.50);
    graph.addEdge(4, 3, 0.36);
    graph.addEdge(3, 5, 0.38);
    graph.addEdge(3, 0, 0.45);
    graph.addEdge(0, 5, 0.29);
    graph.addEdge(5, 4, 0.21);
    graph.addEdge(1, 4, 0.32);
    graph.addEdge(4, 2, 0.32);
    graph.addEdge(5, 1, 0.29);
    
    int[] tour = graph.heldKarp();

Although the Held-Karp algorithm is significantly better than a brute-force algorithm, it is still too slow to apply to large graphs. To process TSPs containing thousands of cities, Smile provides a branch-and-bound algorithm.

Java


    // Branch-and-bound algorithm
    int[] tour = graph.tsp();

For even large TSP problems, Smile provides several heuristic and approximate algorithms, which quickly yield good solutions. For example, nearest, furthest, and arbitrary insertion algorithms are implemented. These methods have a quadratic time complexity. Nearest insertion algorithm yields at most 2 times longer than the optimal solution in the worest case. Furthest insertion and arbitrary insertion have higher bound in the worest case. We can improve the result further with the 2-Opt algorithm, which is a simple local search algorithm. The main idea behind it is to take a route that crosses over itself and reorder it so that it does not.

Java


    // Heuristic algorithm
    var tour = graph.nearestInsertion();
    // 2-Opt algorithm
    double cost = graph.opt2(tour, 3);

For tigher bound, Smile also implements the Christofides-Serdyukov algorithm yields a solution that is at most 1.5 times longer than the optimal solution in the worst case. However, it has a cubic time complexity.

Java


    var tour = graph.christofides();

Maximum Flow Problem

Another important graph task is the maximum flow problem that involves finding a feasible flow through a flow network that obtains the maximum possible flow rate. AdjacencyMatrix implements the push-relabel algorithm that is considered one of the most efficient maximum flow algorithms. The name "push-relabel" comes from the two basic operations used in the algorithm. Throughout its execution, the algorithm maintains a "preflow" and gradually converts it into a maximum flow by moving flow locally between neighboring nodes using push operations under the guidance of an admissible network maintained by relabel operations.

Java


    AdjacencyMatrix graph = new AdjacencyMatrix(6, true);
    graph.addEdge(0, 1, 2);
    graph.addEdge(0, 2, 9);
    graph.addEdge(1, 2, 1);
    graph.addEdge(1, 3, 0);
    graph.addEdge(1, 4, 0);
    graph.addEdge(2, 4, 7);
    graph.addEdge(3, 5, 7);
    graph.addEdge(4, 5, 4);

    double[][] flow = new double[6][6];
    double maxFlow = graph.pushRelabel(flow, 0, 5);

← Interpolation NLP →