Graph Neural Network

In the blog, I exploit and summary the graph neural networks and related concepts based on some online blogs, papers and the book “Deep Learing on Graphs”.

At first, the motivations of graph neural networks can be described as follows:

Feature Engineering: select a small subset of the graph features that have minimal redundancy but maximal relevance to the target.
- Here is an example: given a graph $\mathcal{G}={\mathcal{V}, \mathcal{E}}$ where $\mathcal{V}$ means the vertices and $\mathcal{E}$ shows the edges. It is assumed that each vertex has a set of features $\mathcal{F}={f_1,f_2,\dots f_d}$ and the goal is to select $K$ features from $\mathcal{F}$ to denote the node where $K\ll d$.
Representation Learning: automatically learn the features based on the downstream tasks.
- Basically, the goal is to train a mapping function $f:X\rightarrow Y$ that map the node features $\mathcal{F}$ in $R^d$ to embeddings $\mathcal{E}$ in $R^K$.

The tasks of graph cover many fields:

Supervised Learning: most of them are based on the whole graphs.
- Graph Classification: Classify the label of a set of graphs.
Semi-supervised Learning:
- Inductive Learning: do not know the whole graph when training.
- Tranductive Learning: know the whole graph when training.
  - Transductive Node classification: Classify the label of a set of nodes.
Unsupervised Learning:

Consequently, the graph neural network have the following advantages:

They can handle graphs better than CNNs or RNNs.
They can do propagation guided by the graph structure based on edges instead of taking edges as the features of nodes.
They can learn reasoning graphs from non-structural data.

0. Foundations

0.0 Graphs

In the section, the concepts of graphs are introduced and defined formally. A graph is defined based on the nodes and edges: $\mathcal{G}={\mathcal{V}, \mathcal{E}}$ where $\mathcal{V}$ means the set of vertices and $\mathcal{E}$ shows the edges.

In most of the cases, the whole graph connectivity is recorded by the Adjacency Matrix $\mathcal{A} \in {0,1}^{N\times N}$ where $\mathcal{A}{i,j} = 1$ if vertice $v_i$ and $v_j$ is connected. Otherwise $\mathcal{A}{i,j} = 0$. For each node, the degree is defined as the number of the nodes it connects. $d(v_i)=\sum_{v_j \in \mathcal{V}}\mathcal{1}(v_i, v_j)$

0.0.1 Centrality

Centrality is an important feature to show the node importance in a graph. There are four different types:

Degree Centrality: the number of each node degree. (Treat all neigbhours equally)
Eigenvector Centrality: the corresponding elements in the eigenvector $c_e$ of the largest eigenvalue $\lambda$ for the adjacency matrix $A$ show the nodes’ centrality scores because $\lambda \cdot c_e = A \cdot c_e$ which originates from the definition of the eigenvector centrality $c_e(v_i)=\frac{1}{\lambda}\sum_{j=1}^N A_{i.j}\cdot c_e(v_j)$. (Treat neighbours with different weights)
Katz Centrality: add the constant to the definition of the eigenvector centrality. $c_k=\alpha A c_k+\beta$. Betweenness Centrality: Instead of checking the neighbours, betweenness centrality counts the paths through each node. It is calculating cost because all the nodes pair should be considered to get the final score.

0.0.2 Spectral Graph Theory

Laplacian Matrix: $L=D-A$ where $D$ is a diagonal degree matrix and $A$ is the adjacency matrix. As to the eigenvalues and eigenvectors of the Laplacian Matrix there are two theorems:
- $L$ is nonnegative.
- The number of 0 eigenvalues equals the number of the connected components in the graph.

0.0.3 Complex Graphs

Heterogeneous graphs: different types of vertices and edges.
Bipartite graphs: the vertex set can be divided into two disjoint subsets which means each edge connects two vertices from two subsets.
Multidimensional graphs: different types of edges.
Signed graphs: positive and negative edges.
Hypergraphs: replace the edges to the areas to show the connections.
dynamic graphs: the graphs are changing based on the timestamps.

0.0.4 Tasks

Node classification Link Prediction Graph Classifiation

0.1 Deep Learning

0.1.0 Deep Feedforward Networks

Feedforward Networks are the basic deep learning model which are built by a multilayer neurons. Each layer can be represented by $f^{(i)}$ and the whole model can be represented as $F(x)=f^{(n)}(f^{(n-1)}(\cdots f^{1}(x)))$. The model starts from the input layer and then feedforward the data into the hidden layers and finally return the results from the output layer. For each neuron in the hidden layers, its function is like $h = \alpha(b+\sum_{i=1}^n w_i\cdot x_i)$ where $\alpha(\cdot)$ is the activation function, $w_i$ is the weight and $b$ is the bias.

Activation Functions

Recitifer: $ReLU(z)=max{0, z}$.
Logistic sigmoid: $\sigma(z)=\frac{1}{1+exp(-z)}$
tanh: $tanh(z)=\frac{2}{1+exp(-2\cdot z)}-1=2\cdot\sigma(2z)-1$

0.1.1 Convolutional Neural Network

A common situation for the convolution operation is that there is a noisy signal $f(t)$ and we want to average the value at time $t$ with its nearby values by a weight function $g(c)$. The basic convolution operation can be defined as $(f\ast g)(t)=\int_{\tau=t-n}^{t+n}f(\tau)g(t-\tau)d\tau$

In the practical machine learning scenario, the convolution operation can be extended to data with high dimensions. For a two dimensional image $I$, the convolution operation can be performed with a two-dimensional kernel $K$: $S(i,j) = (I\ast K)(i,j)=\sum_{\tau=i-n}^{i+n}\sum_{j=\gamma-n}^{\gamma+n}I(\tau, \gamma)K(i-\tau,j-\gamma)$ So basically, the convolution operation in deep learning originate from the one in signal processing with the similar physical meaning.

0.1.2 Recurrent Neural Network

The input of the RNN is a sequence and each neuron has two outputs: $y^{(i)}$ and $h^{(i)}$. $h^{(i)}=\alpha(W_{hh}\cdot h^{(i-1)}+W_{hx}x^{(i-1)}+b_{h})$ $y^{(i)}=\alpha(W_{yh}h^{(i)}+b_y)$ where $W$s are the matrices to perform linear transformations, $b$s are the bias terms and $\alpha$ is the activation function.

Long Short-Term Memory: each neuron contains two states: the cell state and the hidden state.
Gated Recurrent Unit: it combines the cell state and the hidden state are merged and the forget gate and the input gate are combined as the update gate.
0.1.3 Tips

Preventing Overfitting:
Weight Regularization
Dropout
Batch NOrmalization

Chapter 1: Basic Concepts

The goal of graph neural networks is to learn a state embedding which encodes the neighborhoods’ information. The state embedding $h_v$ is used to produce an output $o_v$ such as the distribution of the predicted node label. Basically, the parametric function $f$ shared among all nodes which is called local transition function. Another parametric function $g$ produce the output of the node which is called local output function. They are defined as follows where $co[v]$ and $ne[v]$ means the edges and neighbors of node $v$.

\[\begin{align} &h_v=f(x_v,x_{co[v]},h_{ne[v]},x_{ne[v]})\\ &o_v=g(h_v,x_v) \end{align}\]

In the matrix style, the above formulas can be presented as follows.

\[\begin{align} &H^{t+1}=F(H^t,X)\\ &O=G(H,X_N) \end{align}\]

With the target information $t_v$ for the supervision, the loss function can be written as: $loss=\Sigma_{i=1}^p(t_i-o_i)$

where $p$ is the number of supervised nodes.

Basically, there are two kinds of approaches: spectral methods and spatial methods. The former one trains on the Laplacian eigenbasis which depends on the graph structure while the later one define convolutions directly on the graph based on spatially close neighbors.

Chapter 2: Graph Convolutional Networks

$K=1$ What does that mean?

\[g_\theta\cdot x\approx\theta'_0x+\theta'_1(L-I_N)x=\theta'_0x-\theta'_1D^{-\frac{1}{2}}AD^{-\frac{1}{2}}x\]

with two free parameters $\theta’_0$ and $\theta’_1$.

\[Z=\tilde D^{-\frac{1}{2}}\tilde A\tilde D^{-\frac{1}{2}}X\Theta\]

At the very beginning, the definition and related concepts of Graph Neural Network should be clear.

$D$ is degree matrix. $A$ is adjacency matrix. In a graph, the first-order derivative is defined as $f'_{*g}(x)=f(x)-f(y)$ and $y$ is the neighborhood of $x$. So the Laplacian operator, the second-order derivative, is defined as $\Delta_{*g}f'(x)=\Sigma_{y\sim x}f(x)-f(y)$ which can be also presented as $L=D-A=I_N-D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$

Chapter 4: Graph Embedding

Graph Embeddings are used to map nodes from a graph to a feature space with lower dimensions. The goal of designing the map function is from the answer of the two questions: 1. What information should be keep? 2. How to preserve the information?

Several papers answered the first question:

neighborhood information:
- Perozzi, Bryan, Al-Rfou, Rami, and Skiena, Steven. 2014. Deepwalk: Online learning of social representations. Pages 701–710 of: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.
- Tang, Jian, Qu, Meng, Wang, Mingzhe, Zhang, Ming, Yan, Jun, et al. 2015. Line: Largescale information network embedding. Pages 1067–1077 of: Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee
- Grover, Aditya, and Leskovec, Jure. 2016. node2vec: Scalable feature learning for networks. Pages 855–864 of: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.
nodes’ structure role:
- Ribeiro, Leonardo F.R., Saverese, Pedro H.P., and Figueiredo, Daniel R. 2017. struc2vec: Learning node representations from structural identity. Pages 385–394 of: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM
nodes’ status
community information

As to the second question, the basic idea is to make sure the node representations can reconstruct the graphs with the information need to be preserved.

4.1 Simple Graphs

Node co-occurrence/Neighborhood

Random Walk
node2vec
Line

Structural Role

Node Status

Community Structure

4.2 Complex Graphs

Heterogeneous

Bipartite Graph

Multidimensional Graph

Signed Graph

Hypergraph

Dynamic Graph

Chapter 7: Scalable Graph Neural Networks

Reference:

Ma, Yao, and Jiliang Tang. Deep learning on graphs. Cambridge University Press, 2021.