Graph Data Structure
Many machine learning algorithms (e.g. Isomap, UMAP, etc.) employ graph data structures internally. Graphs are mathematical structures used to model pairwise relations between objects from a certain collection. A graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices (also called nodes or points), and the links that connect some pairs of vertices are called edges (also called lines). The edges may be directed (asymmetric) or undirected (symmetric).
There are different ways to store graphs in a computer system. The data
structure used depends on both the graph structure and the algorithm
used for manipulating the graph. Theoretically one can distinguish between
list and matrix structures but in concrete applications the best structure
is often a combination of both. List structures are often preferred for
sparse graphs as they have smaller memory requirements. Matrix structures
on the other hand provide faster access for some applications but can
consume huge amounts of memory. In Smile, we support both adjacency list
(class AdjacencyList
) and adjacency matrix
(class AdjacencyMatrix
).
In adjacency list, each vertex has a list of which vertices it is adjacent to. This causes redundancy in an undirected graph. Adjacency queries are faster, at the cost of extra storage space.
An adjacency matrix is an n by n matrix A, where n is the number of vertices in the graph. If there is an edge from a vertex x to a vertex y, then the element A(x,y) is 1 (or in general the number of edges or a weight), otherwise it is 0. In computing, this matrix makes it easy to find subgraphs, and to reverse a directed graph.
Both AdjacencyList
and AdjacencyMatrix
implement
the abstract interface Graph
. The method dot()
returns the graphic representation in Graphviz dot format. You may try
http://viz-js.com/ to visualize the returned string.
import smile.graph.*;
var graph = new AdjacencyList(8);
graph.addEdge(0, 2);
graph.addEdge(1, 7);
graph.addEdge(2, 6);
graph.addEdge(7, 4);
graph.addEdge(3, 4);
graph.addEdge(3, 5);
graph.addEdge(5, 4);
String[] label = {"a", "b", "c", "d"}
graph.dot("SmileGraph", label);
In this example, we create a DOT graph with name 'SmileGraph' with 4 node labels. Since we have 8 nodes, the rest of nodes will use their integer ID as label.
Graph Traversal
The basic operation on graph is traversal, i.e. to visit the vertices in some
systematic order. Typically, we have breadth first search (BFS) and depth first
search (DFS). Both of these construct spanning trees with certain properties
useful in other graph algorithms. The generic methods bfs(VertexVisitor)
and dfs(VertexVisitor)
take a user-define VertexVisitor
functor to perform specific operations when visiting a vertex.
graph.dfs(i -> println("Visiting vertex " + i));
Connected Components
An application of graph traversal is to compute connected components.
A connected component of an undirected graph is a connected subgraph
that is not part of any larger connected subgraph. The components of
any graph partition its vertices into disjoint sets, and are the
induced subgraphs of those sets. In Graph
class, the methods
bfcc()
and dfcc()
compute the connected
components with BFS and DFS algorithms, respectively. These methods
return a two-dimensional array of which each row is the vertex IDs
in the same connected component.
// compute connected components with BFS
graph.bfcc();
// compute connected components with DFS
graph.dfcc();
A strongly connected component of a directed graph is a maximal subgraph where every pair of vertices is mutually reachable. Note that a conventional DFS method cannot be used to find the strongly connected components. Instead, one may leverage Kosaraju's algorithm or Tarjan's algorithm to compute strongly connected components.
Topological Sort
With DFS or BFS, we can also obtain the topological ordering of a directed graph, which is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG).
var graph = new AdjacencyList(13, true);
graph.addEdge(8, 7);
graph.addEdge(7, 6);
graph.addEdge(0, 1);
graph.addEdge(0, 2);
graph.addEdge(0, 3);
graph.addEdge(0, 5);
graph.addEdge(0, 6);
graph.addEdge(2, 3);
graph.addEdge(3, 4);
graph.addEdge(3, 5);
graph.addEdge(6, 4);
graph.addEdge(6, 9);
graph.addEdge(4, 9);
graph.addEdge(9, 10);
graph.addEdge(9, 11);
graph.addEdge(9, 12);
graph.addEdge(11, 12);
graph.dfsort(); // topological sort with DFS
graph.bfsort(); // topological sort with BFS
Minimum Spanning Tree
A minimum spanning tree (MST) is a subset of the edges of a connected,
edge-weighted undirected graph that connects all the vertices together,
without any cycles and with the minimum possible total edge weight.
That is, it is a spanning tree whose sum of edge weights is as small as
possible. There are many use cases for minimum spanning trees.
With the method prim
, we can calculate an MST by Prim's algorithm.
var graph = new AdjacencyMatrix(6);
graph.addEdge(0, 1, 0.41);
graph.addEdge(1, 2, 0.51);
graph.addEdge(2, 3, 0.50);
graph.addEdge(4, 3, 0.36);
graph.addEdge(3, 5, 0.38);
graph.addEdge(3, 0, 0.45);
graph.addEdge(0, 5, 0.29);
graph.addEdge(5, 4, 0.21);
graph.addEdge(1, 4, 0.32);
graph.addEdge(4, 2, 0.32);
graph.addEdge(5, 1, 0.29);
List mst = new ArrayList<>();
double cost = graph.prim(mst);
mst.forEach(edge -> println(edge))
Shortest Path
With the method dijkstra()
, we can calculate the shortest path
from a source to all other vertices in the graph by Dijkstra algorithm.
If no parameter is provided, Smile will calculate all pairwise shortest
path between all vertecies.
Many manifold algorithms employ the shortest path among the samples as
a similarity measure instead of pairwise distance.
var graph = new AdjacencyMatrix(6, true);
graph.addEdge(0, 1, 0.41);
graph.addEdge(1, 2, 0.51);
graph.addEdge(2, 3, 0.50);
graph.addEdge(4, 3, 0.36);
graph.addEdge(3, 5, 0.38);
graph.addEdge(3, 0, 0.45);
graph.addEdge(0, 5, 0.29);
graph.addEdge(5, 4, 0.21);
graph.addEdge(1, 4, 0.32);
graph.addEdge(4, 2, 0.32);
graph.addEdge(5, 1, 0.29);
double[][] distances = graph.dijkstra();
Travelling Salesman Problem
The travelling salesman problem (TSP) is another important task in graph theory, combinatorial optimization, and operations research. It asks the following question: "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It is an NP-hard problem. Smile provides the Held-Karp algorithm to compute the optimal solution of TSP. It is an efficient dynamic programming algorithm, significantly better than the superexponential performance {\displaystyle \Theta (n!)} of a brute-force algorithm.
var graph = new AdjacencyMatrix(6);
graph.addEdge(0, 1, 0.41);
graph.addEdge(1, 2, 0.51);
graph.addEdge(2, 3, 0.50);
graph.addEdge(4, 3, 0.36);
graph.addEdge(3, 5, 0.38);
graph.addEdge(3, 0, 0.45);
graph.addEdge(0, 5, 0.29);
graph.addEdge(5, 4, 0.21);
graph.addEdge(1, 4, 0.32);
graph.addEdge(4, 2, 0.32);
graph.addEdge(5, 1, 0.29);
int[] tour = graph.heldKarp();
Although the Held-Karp algorithm is significantly better than a brute-force algorithm, it is still too slow to apply to large graphs. To process TSPs containing thousands of cities, Smile provides a branch-and-bound algorithm.
// Branch-and-bound algorithm
int[] tour = graph.tsp();
For even large TSP problems, Smile provides several heuristic and approximate algorithms, which quickly yield good solutions. For example, nearest, furthest, and arbitrary insertion algorithms are implemented. These methods have a quadratic time complexity. Nearest insertion algorithm yields at most 2 times longer than the optimal solution in the worest case. Furthest insertion and arbitrary insertion have higher bound in the worest case. We can improve the result further with the 2-Opt algorithm, which is a simple local search algorithm. The main idea behind it is to take a route that crosses over itself and reorder it so that it does not.
// Heuristic algorithm
var tour = graph.nearestInsertion();
// 2-Opt algorithm
double cost = graph.opt2(tour, 3);
For tigher bound, Smile also implements the Christofides-Serdyukov algorithm yields a solution that is at most 1.5 times longer than the optimal solution in the worst case. However, it has a cubic time complexity.
var tour = graph.christofides();
Maximum Flow Problem
Another important graph task is the maximum flow problem that involves finding a feasible flow through a flow network that obtains the maximum possible flow rate. AdjacencyMatrix implements the push-relabel algorithm that is considered one of the most efficient maximum flow algorithms. The name "push-relabel" comes from the two basic operations used in the algorithm. Throughout its execution, the algorithm maintains a "preflow" and gradually converts it into a maximum flow by moving flow locally between neighboring nodes using push operations under the guidance of an admissible network maintained by relabel operations.
AdjacencyMatrix graph = new AdjacencyMatrix(6, true);
graph.addEdge(0, 1, 2);
graph.addEdge(0, 2, 9);
graph.addEdge(1, 2, 1);
graph.addEdge(1, 3, 0);
graph.addEdge(1, 4, 0);
graph.addEdge(2, 4, 7);
graph.addEdge(3, 5, 7);
graph.addEdge(4, 5, 4);
double[][] flow = new double[6][6];
double maxFlow = graph.pushRelabel(flow, 0, 5);