Traditional convolutional neural networks (CNNs) have been hugely successful on a number of problems. In the case of visual processing, they work by sliding small filters (or kernels) across each pixel and its neighbours, and producing a single value. The weights of these filters are trainable, which allows the network to automatically learn how to extract important features from the input. Each filter is small, there are only a handful of weights that need to be learned, making CNNs fast to train compared to most other networks. For a more detailed explanation of how CNNs work, check out: http://cs231n.github.io/convolutional-networks/.
Applying CNNs to graphs is more problematic, since the number of neighbours of a node is not fixed. This means that the kernel size would have to change for different nodes, leading to a different kernel for each node. This approach would lose all of the benefits of CNNs and be computationally expensive. (Kipf and Welling, 2017) introduced a method for implementing graph convolutional networks (GCNs) based on the spectral properties of graphs. Their approach applies a symmetric normalization to the adjacency matrix of the graph. As a result, the feature vector for each node becomes a function of the feature vectors of its adjacent nodes. Thomas Kipf has a much better explanation of the approach on his blog: https://tkipf.github.io/graph-convolutional-networks/. The original paper can be found here: https://arxiv.org/abs/1609.02907.
The original paper did not take into account edge labels or direction, but more recent work has improved upon this (https://arxiv.org/abs/1703.06103). In essence, the approach is to construct a separate adjacency matrix for each unique label-direction pair.
There are numerous problems where input data can take the form of a graph, but I wanted to focus on the applications of GCNs to natural language processing, specifically sequence labelling. To create a graph out of a sentence, sentences were passed through a dependency parser, which creates a graph of relations between the words. I used spaCy’s dependency parser: https://spacy.io/.
I modified Thomas Kipf’s relational GCN implementation (https://github.com/tkipf/relational-gcn) to make it compatible with Keras 2 on the Tensorflow backend. This makes it possible to train the model using a GPU, significantly decreasing the training time. The consequence of this was that a data generator had to be used to convert sparse matrices on-the-fly. I evaluated the system on the CoNLL-2003 named entity recognition dataset. The current state-of-the-art for this dataset is 92.22 F1 score, achieved by a CNN-BiLSTM-CRF.
The best score the system has achieved is a meagre 82.69. Whilst this is hardly state-of-the-art, I think that this is a good start, and that with further development, GCNs could be highly successful in NLP. I have made the project available on Github: https://github.com/JHart96/keras_gcn_sequence_labelling.