Python – Tensorflow seq2seq chatbots always give the same output

Tensorflow seq2seq chatbots always give the same output… here is a solution to the problem.

Tensorflow seq2seq chatbots always give the same output

I’m trying to make a seq2seq chatbot using Tensorflow, but it seems to converge to the same output, albeit with different inputs. The model gives different outputs when first initialized, but quickly converges to the same output after a few epochs. Even after many eras and low costs, this is still a problem. However, these models seem to perform well when trained with smaller datasets, such as 20, but fail with larger datasets.

I’m training Cornell Movie Dialogs Corpus using 100-dimensional and 50,000-word glove pre-training embeddings.

When given a completely different input, the encoder seems to have a very close final state (in the range of about 0.01). I’ve tried using simple LSTM/GRU, bidirectional LSTM/GRU, multilayer/stacked LSTM/GRU, and multilayer bidirectional LSTM/GRU. RNN nodes have been tested with 16 to 2048 hidden units. The only difference is that when there are fewer hidden units, models tend to output only the start and end tags (GO and EOS).

For multi-layer GRU, here is my code:

cell_encode_0 = tf.contrib.rnn.GRUCell(self.n_hidden)
cell_encode_1 = tf.contrib.rnn.GRUCell(self.n_hidden)
cell_encode_2 = tf.contrib.rnn.GRUCell(self.n_hidden)
self.cell_encode = tf.contrib.rnn.MultiRNNCell([cell_encode_0, cell_encode_1, cell_encode_2])
# identical decoder

...

embedded_x = tf.nn.embedding_lookup(self.word_embedding, self.x)
embedded_y = tf.nn.embedding_lookup(self.word_embedding, self.y)

_, self.encoder_state = tf.nn.dynamic_rnn(
    self.cell_encode,
    inputs=embedded_x,
    dtype=tf.float32,
    sequence_length=self.x_length
    )

# decoder for training
helper = tf.contrib.seq2seq.TrainingHelper(
    inputs=embedded_y,
    sequence_length=self.y_length
    )

decoder = tf.contrib.seq2seq.BasicDecoder(
    self.cell_decode,
    helper,
    self.encoder_state,
    output_layer=self.projection_layer
    )

outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, maximum_iterations=self.max_sequence, swap_memory=True)

return outputs.rnn_output

...

# Optimization
dynamic_max_sequence = tf.reduce_max(self.y_length)
mask = tf.sequence_mask(self.y_length, maxlen=dynamic_max_sequence, dtype=tf.float32)
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=self.y[:, :dynamic_max_sequence], logits=self.network())
self.cost = (tf.reduce_sum(crossent * mask) / batch_size)
self.train_op = tf.train.AdamOptimizer(self.learning_rate).minimize(self.cost)

See github. (If you want to test it, run train.py).

As for hyperparameters, I’ve tried the learning rate from 0.1 all the way to 0.0001 and the batch size from 1 to 32. Apart from the regular and expected effects, they do not help solve the problem.

Solution

After months of in-depth research, I finally found the problem. It seems that the RNN needs the GO token in the decoder input, but not in the output (the one you use for the cost). Basically, RNNs expect their data to be as follows:

Encoder input: GO foo foo foo EOS

Decoder input/ground truth: GO bar bar bar EOS

Decoder output: bar bar bar EOS EOS/PAD

In my code, I include GO tags in both the decoder input and output, causing the RNN to repeat the same tag (GO -> GO, bar -> bar). This can be easily solved by creating an additional variable with no ground fact for the first column (GO tag). In numpy, this looks like

# y is ground truth with shape[0] = batch and shape[1] = token index
np.concatenate([y[:, 1:], np.full([y.shape[0], 1], EOS)], axis=1)

Related Problems and Solutions