Generating plausible paper titles with Recurrent Neural Networks

This is a fun project that occured to me while reading month after month the email with the table of contents from IEEE Transactions on Neural Networks and Learning Systems journal. It seemed to me that the titles followed a pattern that consisted of some conjuctions, prepositions and particles, intermingled with a lot of keywords, specific to the field. So I thought it should be easy to learn to generate titles with a recurrent neural network and a small corpus. Let's see what I got.

Data

I exported the emails (see below) into a text file and with some text processing the final txt file contains all the titles (one title by line) from March 2016 till November 2019.

Screenshot 2019-11-07 11.25.08

Pipeline

The whole pipeline can be found in my deep-learning-pipelines repository as an ipython notebook.

I start with the imports and then download the NLTK model data (you need to do this once):

import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline
%%capture
# Download NLTK model data (you need to do this once)
nltk.download("book")

I read the file with the titles and now I am ready to check the data:

with open('ieee-tnnls-titles.txt', 'r') as f:
    text = f.read()

Data exploration

Let's explore the dataset a bit.

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
titles = text.splitlines()
print('Number of titles: {}'.format(len(titles)))

word_count_sentence = [len(title.split()) for title in titles]
print('Average number of words in each title: {}'.format(np.average(word_count_sentence)))

We have:

1207 titles,
around 2705 unique words, while
the average number of words in each title is 10.2

So the ratio of titles to unique words is 0.5, which means probably we should prune the vocabulary and its interactions to be learnt to much fewer words.

Pre-processing

My next step was to preprocess the titles, append END and START tokens in-front and in the back of each title, then tokenize the titles into words and remove non alphabetical tokens.

First I declared three tokens to be used for a) unknown words, b) the start and c) the end of a title:

unknown_token = "UNKNOWN_TOKEN"
title_start_token = "TITLE_START"
title_end_token = "TITLE_END"

Then I proceed into sentence and word tokenization along with appending the start/end tokens:

from nltk.tokenize import sent_tokenize, word_tokenize
sentences = itertools.chain(*[nltk.sent_tokenize(x.lower()) for x in titles])
tokenized_titles = ["%s %s %s" % (title_start_token, x, title_end_token) for x in sentences]
tokenized_titles = [nltk.word_tokenize(title) for title in tokenized_titles]
final_title = []
for title in tokenized_titles:
  final_title.append([token for token in title if token.isalpha() or token == title_start_token or token == title_end_token])
tokenized_titles = final_title

An example of a tokenized title will be:

['TITLE_START', 'object', 'detection', 'with', 'deep', 'learning', 'a', 'review', 'TITLE_END']

During this pre-processing step 2073 unique word tokens were found. Since the corpus is not very large, I will try and learn the connections between the most popular words in order to have enough samples to learn meaningful interconnections. Thus I chose a vocabulary size of 250. So the next steps are: to find these frequent words, replace the rest with the UKNOWN token and build index_to_word (a mapping from an integer to a word) and word_to_index (vice-versa) mappings:

vocabulary_size = 250
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print("Using vocabulary size %d." % vocabulary_size)
print("The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

The least frequent word in our dictionary of 250 words appeared to be "stable" (7 appearances) and the most frequent "for" (553 appearences).

As a next step I replaced all words not in our vocabulary with the UNKNOWN token:

for i, sent in enumerate(tokenized_titles):
    tokenized_titles[i] = [w if w in word_to_index else unknown_token for w in sent]

So the title: "Plume Tracing via Model-Free Reinforcement Learning Method" would look like after pre-processing as: $^{'} T I T L E_{S} T A R T^{'},^{'} U N K N O W N_{T} O K E N^{'},^{'} U N K N O W N_{T} O K E N^{'},^{'} v i a^{'},^{'} r e i n f o r c e m e n t^{'},^{'} l e a r n i n g^{'},^{'} m e t h o d^{'},^{'} T I T L E_{E} N D^{'}$

Training

Let's create the training data. I used the KerasBatchGenerator from this blog post to generate the batches to be fed into the LSTMs:

class KerasBatchGenerator(object):

  def __init__(self, data, num_steps, batch_size, vocabulary, skip_step=5):
    self.data = data
    self.num_steps = num_steps
    self.batch_size = batch_size
    self.vocabulary = vocabulary
    # this will track the progress of the batches sequentially through the
    # data set - once the data reaches the end of the data set it will reset
    # back to zero
    self.current_idx = 0
    # skip_step is the number of words which will be skipped before the next
    # batch is skimmed from the data set
    self.skip_step = skip_step

  def generate(self):
    x = np.zeros((self.batch_size, self.num_steps))
    y = np.zeros((self.batch_size, self.num_steps, self.vocabulary))
    while True:
      i = 0
      while i < self.batch_size:
        # I don't want to see in x a title end token to predict y 
        if self.current_idx < len(self.data) and self.data[self.current_idx] == word_to_index[title_end_token]:
          self.current_idx += self.skip_step
        if self.current_idx + self.num_steps >= len(self.data):
          # reset the index back to the start of the data set
          self.current_idx = 0
        x[i, :] = self.data[self.current_idx:self.current_idx + self.num_steps]
        temp_y = self.data[self.current_idx + 1:self.current_idx + self.num_steps + 1]
        # convert all of temp_y into a one hot representation
        y[i, :, :] = to_categorical(temp_y, num_classes=self.vocabulary)
        self.current_idx += self.skip_step
        i += 1
      yield x, y

Through the generator, batches of 10 tokens that predict the next token (in one hot encoding form) are generated. Each batch contains 2 arrays that contain 10 tokens each. The first array has 10 integers, while the second array has 10 one hot encoding vectors that represent the equivalent next tokens of the first array. For example:

The 2 arrays are of the form:

[[0.],[122.],[249.],[29.],[3.],[187.],[11.],[0.],[40.],[3.]]

and

[[[0., 0., 0., ..., 0., 0., 0.]],
 [[0., 0., 0., ..., 0., 0., 1.]],
...

The START token in the first array, which is 0, to predict the one-hot encoded version of 122 (which is the next token after 0)
The 122 token to predict the one-hot encoded version of 249
The 249 token to predict the one-hot encoded version of 29
and so on and so forth...

The first 10K tokens are employed for generating training batches, while the rest 3846 for validation. As a note, we never have a sample that uses the END token to predict the next token. Let's create the batch generators:

num_steps = 1
skip_step = 1
batch_size = 10

# set seeds for reproducibility
from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(234)

# Create the training data
# A concatenation of all tokens as integers (indices)
X = list(itertools.chain(*np.asarray([[word_to_index[w] for w in sent] for sent in tokenized_titles])))
# Create 2 batch generators out of the concatenation
train_data_generator = KerasBatchGenerator(X[:10000], num_steps, batch_size, vocabulary_size, skip_step)
valid_data_generator = KerasBatchGenerator(X[10001:], num_steps, batch_size, vocabulary_size, skip_step)

Next I create the model:

from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, Dropout, TimeDistributed
from keras.layers import LSTM
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint

hidden_size = 250

model = Sequential()
model.add(Embedding(vocabulary_size, hidden_size, input_length=num_steps))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(Dropout(rate=0.5))
model.add(TimeDistributed(Dense(vocabulary_size)))
model.add(Activation('softmax'))

compile the model:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

and train the model for 10 epochs:

num_epochs = 10

model.fit_generator(train_data_generator.generate(), len(X[:10000])//(batch_size*num_steps), num_epochs,
                        validation_data=valid_data_generator.generate(), validation_steps=len(X[10001:])//(batch_size*num_steps))

After training we got a validation categorical accuracy of 0.3625, which is of course much better than randomly predicting around 250 tokens.

Generating

Now it is time to check the model. We start by feeding the model a START token and keep sampling until there is an END token. We resample if the sampling generates the UNKNOWN token:

def generate_title(model):
    # We start the sentence with the start token
    new_title = [word_to_index[title_start_token]]
    # Repeat until we get an end token
    while not new_title[-1] == word_to_index[title_end_token]:
        x = np.zeros((1,1))
        x[0, :] = new_title[-1]
        next_word_probs = model.predict(x)[0][0]
        sampled_word = word_to_index[unknown_token]
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            samples = np.random.multinomial(1, next_word_probs)
            sampled_word = np.argmax(samples)
        new_title.append(sampled_word)
    title_str = [index_to_word[x] for x in new_title[1:-1]]
    return title_str

num_sentences = 30
senten_min_length = 7
senten_max_length = 15

for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) < senten_min_length or len(sent) > senten_max_length:
        sent = generate_title(model)
    print(" ".join(sent))

We generated 30 sentences between 7 and 15 tokens:

a new active systems under control of boolean network for deep noise framework and processes
multiview metric clustering and neural networks approach for heterogeneous systems and noise
learning structure of nonlinear multiagent systems and unknown systems
deep neural networks with adaptive delays of regression
adaptive stochastic models using active learning processes
stability analysis for mimo neural networks with delays
on state estimation of a new iterative learning
a class of online model for a novel recurrent neural network
a controller for feature analysis of neural networks and noise
collaborative quality of and the neural networks a unified
sparse representation of delayed neural network representation with
the feature selection based on a application to stochastic delays via regularization
unified analysis for a deep transfer learning
a network of coupled uncertain delay and application to semisupervised classification
a deep convolutional neural networks with communication constraints and its switched linear multiagent systems
multimodal data for nonlinear systems with adaptive complex networks
optimal delays control of multiple learning for clustering
linear data design of delayed jump neural network for linear systems
memristive generalized efficient estimation for feature selection for modeling for nonlinear systems
a deep convolutional neural dynamic systems with hierarchical
a constrained iterative learning with multiple least the classification
sequential metric learning with a supervised systems with learning
exponential synchronization of communication processes and its switched systems using neural networks
application to mixture of gaussian heterogeneous and time delays
a new control for generalized domain adaptation
robust concept and local method for heterogeneous dynamic programming by
a novel adaptive control of graph analysis for nonlinear kernel convolutional neural networks with delays
optimal control of time regression and its application to features
semisupervised feature optimization and probabilistic matrix learning
markov for and an multiobjective framework with dynamical delay

Even though the network didn't learn any grammar rules, some plausible titles were generated. For example (even though I wouldn't know what it would be about):

adaptive stochastic models using active learning processes

and my favorite:

a novel adaptive control of graph analysis for nonlinear kernel convolutional neural networks with delays

what a mouthfull!

References

At this point I should mention that I re-used some code from:

https://adventuresinmachinelearning.com/keras-lstm-tutorial/ (mainly the KerasBatchGenerator)
https://github.com/dennybritz/rnn-tutorial-rnnlm/blob/master/RNNLM.ipynb (Pre-processing and generating text snippets)

Data

Pipeline

Data exploration

Pre-processing

Training

Generating

References

Comments

About me

AMA

Practical Machine Learning in R

Follow

kyrcha.info subdomains

Subscribe to the blog

Blog Categories

Let's work together