Github repository:

ELMo embeddings

Pre-trained word embeddings have been highly successful in numerous natural language processing tasks.  Traditional word embeddings are essentially lookup tables mapping a word to its corresponding feature vector. As a result, the context of a word is not taken into account when generating its embedding. Allen AI recently released a paper introducing a method to overcome this limitation: Embeddings from Language Models (ELMo). The paper can be found here:

ELMo embeddings work by first training a bidirectional LSTM on a large corpus for a general language task. The original paper used the 1 billion word benchmark dataset, but Allen AI have since released a version trained on 5.5 billion words. Once trained, the word embeddings are generated by feeding sentences into the bidirectional LSTM, and taking the internal states of the layers. Embeddings can be taken from the internal states of any one of the 3 layers, or they can be taken as a weighted sum of all 3 layers. These weights can be learned for task-specific training.

In “elmo” mode, the embeddings are the weighted sums of the other layers.

By taking the internal states of the network, we get a context-dependent representation of the word. ELMo embeddings work extremely well in practice. In their evaluation of 6 common NLP tasks, Allen AI found that the addition of ELMo embeddings to existing models led to significant improvements over state-of-the-art in every task.

Keras implementation

I’ve written a Keras layer that makes it easy to include ELMo embeddings in any existing Keras model. The layer is based on a Tensorflow Hub module (, but provides an interface, making it completely interchangeable with a standard Keras embedding layer. The layer can output either a mean-pooled embedding of the sentence, or an embedding of each word.

Full documentation and code for the layer is available in the repository on Github. Below, I’ll give a few examples of how it works. These examples are modified from the Keras examples directory (

Example: Sentiment Analysis I

To show how the layer works in practice, below is an example of a model for sentiment analysis on the IMDB dataset built into Keras. This first example uses sentence-level embeddings, which are a mean pooling of the word-level embeddings, this mode is called “default”.

Example: Sentiment Analysis II

If you’re looking for word-level embeddings, here is an example using ELMo embeddings as the input to a convolutional neural network for the same task as above.


ELMo embeddings are significantly more computationally expensive than traditional embeddings, but their ability to easily boost performance on NLP tasks makes them useful to have in the toolkit. I hope you found this post useful, be sure to check out the repository over on Github:

Categories: Projects


YunxiaDing · November 4, 2018 at 3:06 am

I use keras frame. When I use “pip install allennlp” install elmo
it has an error ” Could not find a version that satisfies the requirement torch=0.4.1 (from allennlp) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch=0.4.1 (from allennlp)”.
How to install elmo in keras.
Thank you !

    admin · November 6, 2018 at 2:29 pm


    You don’t need to install allennlp for this Keras layer.
    As long as you have the installations in the “requirements.txt” file, you should be able to use the layer.


giuslan · June 26, 2019 at 4:31 pm

thanks for your contribution and for your neat code, I used your class in a Keyphrase Extraction project.
Only one point: the method get_config() must return a dictionary, not a list, or you will not be able to load a saved model. I used keras + tf backend, and the Model class API (not the Sequential).
I hope this would be helpful for you.

Leave a Reply

Your email address will not be published. Required fields are marked *