Background

Traditional convolutional neural networks (CNNs) have been hugely successful on a number of problems. In the case of visual processing, they work by sliding small filters (or kernels) across each pixel and its neighbours, and producing a single value. The weights of these filters are trainable, which allows the network to automatically learn how to extract important features from the input. Each filter is small, there are only a handful of weights that need to be learned, making CNNs fast to train compared to most other networks. For a more detailed explanation of how CNNs work, check out: http://cs231n.github.io/convolutional-networks/.

Traditional convolution network on an image. Padding is applied to the edge of the original image to ensure that the convolved image is the same size as the original. The filter moves left-to-right, top-to-bottom.

Applying CNNs to graphs is more problematic, since the number of neighbours of a node is not fixed. This means that the kernel size would have to change for different nodes, leading to a different kernel  for each node. This approach would lose all of the benefits of CNNs and be computationally expensive. (Kipf and Welling, 2017) introduced a method for implementing graph convolutional networks (GCNs) based on the spectral properties of graphs. Their approach applies a symmetric normalization to the adjacency matrix of the graph. As a result, the feature vector for each node becomes a function of the feature vectors of its adjacent nodes. Thomas Kipf has a much better explanation of the approach on his blog: https://tkipf.github.io/graph-convolutional-networks/. The original paper can be found here: https://arxiv.org/abs/1609.02907.

In GCNs, nodes must be convolved with varying numbers of other nodes, depending on their degree. This makes it impossible to define a fixed kernel in the traditional sense. 

The original paper did not take into account edge labels or direction, but more recent work has improved upon this (https://arxiv.org/abs/1703.06103). In essence, the approach is to construct a separate adjacency matrix for each unique label-direction pair.

Keras implementation

There are numerous problems where input data can take the form of a graph, but I wanted to focus on the applications of GCNs to natural language processing, specifically sequence labelling. To create a graph out of a sentence, sentences were passed through a dependency parser, which creates a graph of relations between the words. I used spaCy’s dependency parser: https://spacy.io/.

A sentence passed through spaCy’s dependency parser can easily be converted into a graph representation where words become nodes, and dependencies become edges.

I modified Thomas Kipf’s relational GCN implementation (https://github.com/tkipf/relational-gcn) to make it compatible with Keras 2 on the Tensorflow backend. This makes it possible to train the model using a GPU, significantly decreasing the training time. The consequence of this was that a data generator had to be used to convert sparse matrices on-the-fly. I evaluated the system on the CoNLL-2003 named entity recognition dataset. The current state-of-the-art for this dataset is 92.22 F1 score, achieved by a CNN-BiLSTM-CRF.

The best score the system has achieved is a meagre 82.69. Whilst this is hardly state-of-the-art, I think that this is a good start, and that with further development, GCNs could be highly successful in NLP. I have made the project available on Github: https://github.com/JHart96/keras_gcn_sequence_labelling.

Categories: Projects

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *