Lecture 19: Case Study: Text Generation
Introduction to text generation as a case study for deep learning and generative modeling.
Generating Natural Language
Language generation encompasses a wide range of tasks, including dialog system, machine translation, captioning, etc. The most common approach is to calculate the probability of the sentence as a product of the conditional probabilities over the sequence of tokens and use maximum likelihood estimation (MLE) to optimize the parameters:
This widely used training paradigm, while practically effective, inevitably suffers from two fundamental problems:
- Exposure bias: The model is trained to generate the next word conditioned on the ground truth previous word sequence. However, at test time, the resulting model predicts an entire sequence one word at a time given the previous words generated by itself. The model is never exposed to the predictions of its own during training, and as a result the errors made along the way will quickly accumulate.
- Discrepancy between training objectives and evaluation metrics: The model is trained to maximize data log-likelihood but is evaluated using metrics like BLEU or ROUGE.
To bridge the gap between the training objective and test-time behavior, a reinforcement learning approach is proposed to maximize expected reward under the model distribution
At training time, the sequences are directly sampled from the model distribution
However, due to the enormous space of possible word sequence, this method has high variance and poor exploration efficiency during training. Several works have been proposed to make the training efficient:
- Reward augmented maximum likelihood (RAML)
: Add reward-aware perturbation to the MLE data examples. - Softmax policy gradient (SPG)
: Use reward distribution for effective sampling and estimating policy gradient. - Data noising
: Add random noise to data.
Although seem different, these methods prove in another paper
Generalized Entropy Regularized Policy Optimization
The generalized ERPO objective is as follow:
where
- We impose supervision
R onq , so thatq is optimized to gain more rewards. - The KL divergence enforces model
p to stay close to the variational distributionq . - Use additional entropy regularizer on
q .
This objective could be solved using EM algorithm. For iteration
- In E-step,
- In M-step,
The intuition is that when
MLE, Data Noising, RAML and SPG as special cases of the ERPO framework
The ERPO framework has three key hyperparameters, the reward function
MLE as a Special Case of ERPO
The
By setting
- E-step:
- M-step:
The E-step corresponds to empirical data distribution while the M-step corresponds to maximum likelihood estimation.
Data Noising as a Special Case of ERPO
Data Noising can be thought as estimating the MLE of the data that is closed to the observed data.
For that assume that
In the case that
Reward Augmented Maximum Likelihood (RAML) as a Special Case of ERPO
For the rest of this section we let
- E-step:
- M-step:
is exactly the objective that RAML tries to maximize.
Softmax Policy Gradient as a Special Case of ERPO
By setting
- E-step:
- M-step: which correspond to SPG.
Comparing MLE, Data Noising, RAML and SPG
First observe that all of MLE, Data Noising and RAML consists of a single execution of the E-M steps.
MLE uses
RAML allows the whole exploration space. The data points that are closed to the observed data are more likely to be examine. However because it consists of a single execution of the E-M step it “does not have time” to integrate any new information and reconsider the probabilities according to which it observes new data.
Finally SPG allows for the whole exploration space. In addition the E-M steps are executed
multiple times. As a result it integrates new information that is obtained by the exploration (here we are using the fact that
Interpolation Algorithm
In the above we saw that by choosing the proper reward function we can discourage (e.g. by setting
The interpolation algorithm tries to leverage the advantages of both, having a small and having a large exploration space. It exploits the natural idea of starting with a more restricted exploration space which gradually expands as time progresses. By varying the hyperparameters in the ERPO framework it starts by trying to estimate the MLE estimator. Then, as time progress it behaves
more and more like
Conditional Generation
The goal is to generate text that contains desired information infered from inputs. How we generate text depends on the amount of data for the task. When the training data is sufficient, we can use end-to-end training (i.e. sequence-to-sequence models, transformers, etc.).
Examples (# training examples)
- Machine Translation (10s of millions)
- Data Description (10s of thousands)
However for certain tasks, we do not have enough data for supervised trainings.
Examples:
- Attribute control
- Conversation control
In such cases, we consider controlled text generation in an unsupervised setting.
Text attribute transfer
The goal is to modify attribute values (ex. sentinment, tense) while keeping all other aspects unchanged
E.g., transfer sentiment from negative to positive:
- Original: “It was super dry and had a weird taste to the entire slice.”
- Output: “It was super fresh and had a delicious taste to the entire slice.”
Set up
- original sentence
\mathbf{x} , - original attribute
\mathbf{a}_x - target sentence
\mathbf{y} - target attribute
\mathbf{a}_y
Task:
\mathbf{y} has attribute\mathbf{a}_y \mathbf{y} shares all attribute-independent properties of\mathbf{x}
Use encoder-decoder architecture to model:
Sub-objectives
We jointly optimize two competing sub-objectives use cross-entropy loss. One for generating a sentence close to the original and another than ensure the sentiment of the generated sentence is correct (using a pretrained sentiment classifier).
- Auto-encoding loss:
- Classification loss:
Results
- Sentiment classification accuracy: 92%
- BLEU metric (measured against input sentence): 54
- LM perplexity: 239.8
Although the sentiment modification is successful (high classification accuracy) and the output sentence keeps similar non-attribute aspects as the original (BLEU score), the sentences do not make much sense (low perplexity). See the example below that shows the quality of genenerated sentences.
Solution: add language model (LM) as a discriminator with the objective
Results
- Sentiment classification accuracy: 91%
- BLEU metric: 57
- LM perplexity: 60.9
The generated sentences have similiar accuracy and BLEU scores but a much better perplexity. See example below.
- Original: “uncle george is very friendly to each guest”
- Output: “uncle george is very lackluster to each guest”
- with LM: “uncle george is very rude to each guest”
Text Content Manipulation
Another task is text content manipulation task, the goal of the task is to generate a new realistic sentence
There is no direct supervision data for this task. Similar to previous text attribute transfter task, we still use the keg idea proposed in
Model
Example training data and the model is shown below.
Let
Competing Learning Objectives
Here we detail our two competitive sub-objectives. First we make use of the side information
This is content fidelity objective.
The second goal is to preserve the style of reference
We call it the style preservation objective. The objective essentially treats the reference sentence encoder and the decoder together as an auto-encoding module, and effectively drives the model to absorb the characteristics of reference sentence and apply to the generated one.
The above two objectives are coupled together and train the model to achieve the desired goals:
where
Content Coverage Constraint
Generally the above model can achieve good performance but sometimes it can
still not express the desired content accurately. So an additional learning constraint is devised based on
the nature of content description—each data tuple in the content record should usually be mentioned exactly once
in the generated sentence. This constraint on
where
Now the full training objective of the proposed model with the constraint is thus written as:
where
Below is an example some sample output by this method and other methods
Text of erroneous content is highlighted in red, where […] indicates desired content is missing. Text portions in the reference sentences and the generated sentences by our model that fulfill the stylistic characteristics are highlighted in blue.
We can see that other methods tend to keep irrelevant content originally in the reference sentence (e.g., “and 5 rebounds” in the second case), or miss necessary information in the record (e.g., one of the player names was missed in the third case). The proposed model performs better in properly adding or deleting text portions for accurate content descriptions.
Below are some quantitive evaluation results and human evaluation results with other models
From the automatic evaluation results, we can find the model
Target-guided Open-domain Conversation
Target Guided conversation can be classified to the three classes below:
-
Task-oriented dialog: address a specific task, typically in close domain, e.g., service bot for booking a flight in the domain of flight service.
-
Open-domain chit-chat: improve user engagement, the conversation is random and hard to control, e.g., Apple Siri and Amazon Alex.
-
Target-guided conversation: something between the previous two tasks, the conversation is in open-domain and we want to control the conversation strategy to reach a desired topic in the end of conversation. It can bridge task-oriented dialog and open-domain chit-chat, e.g. conversational reccommender system, education, psychotherapy.
Here we discuss the target-guided conversation task, in this task we want our agents can start from any topic and reach a desired topic in the end of conversation with smooth and natural transitions. A successful example of target-guided conversation is shown below, we can see the agent can change the conversation topic from the starting “tired” to our final desired “e-books” with natural conversation and smooth transitions.
The challenge of open-domain target-guided conversation task is that there is no direct supervised data for the data, the conversation generation is totally unsupervised. To solve this challenge, we still use the previous idea, i.e, decompose the task into competitive sub-objectives and apply supervision methods on the sub-objectives. Partically, we have to achieve the subgoal of the conversation being natural and smooth and the subgoal that we reach the desired topic in the end. To make the conversation smooth and natural, we use chit-chat data to learn smooth single-turn transition; to reach desired target topic, we use rule-based multi-turn planning to restrict the agent to choose specific conversation topics at next step.
Here is a diagram showing how target-guided open-domain conversation agent works.
As the diagram shows, to achieve the open-domain target guided conversation. Given human utterance, we first extract the keyword(s) and then we use choose next conversation topic(s) based on learned kernel-based topic transition and target-guided rule, and then retrieve the conditional response for the next step conditioned on the keywords. When generating the keyword(s) for next step’s conversation topic(s), we can tune the relative weight of the two subgoals to control how relativly important to achieve smooth topic transition and to get closer to the keyword(s) of the target topic(s).
On the left of below is a successful example where the target topic is “dance” and we can see the transition is smooth and the conversation is natural. On the right is an example where the agent fails to change to the desired topic “listen”.
In summary, for the three tasks. We can decompose the task into competive sub-objectives and jointly train the sub-objectives with direct supervision.
Summary
We have two central goal for text generating tasks.
- Generating human-like, grammatical, and readable text I.e., generating natural language
-
Generating text that contains desired information inferred from inputs.
- Machine translation: Source sentence –> target sentence w/ the same meaning
- Data description: Table –> data report describing the table
- Attribute control: Sentiment: positive –> “I like this restaurant”
- Conversation control: Control conversation strategy and topic Source sentence –> target sentence w/ the same meaning
For supervised task where is plenty of supervised training data, we can use the sequence modelling to do end-to-end training, but for unsupervised tasks where there is no supervised data, we can decompose the task into competitive sub-objectives and train the sub-objectives with supervision.
References
- Sequence Level Training with Recurrent Neural Networks
Ranzato, M., Chopra, S., Auli, M. and Zaremba, W., 2016. 4th International Conference on Learning Representations, {ICLR} 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. - Reward Augmented Maximum Likelihood for Neural Structured Prediction
Norouzi, M., Bengio, S., Chen, Z., Jaitly, N., Schuster, M., Wu, Y. and Schuurmans, D., 2016. Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1723--1731. - Cold-Start Reinforcement Learning with Softmax Policy Gradient
Ding, N. and Soricut, R., 2017. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, {USA}, pp. 2814--2823. - Data Noising as Smoothing in Neural Network Language Models
Xie, Z., Wang, S.I., Li, J., L{\'{e}}vy, D., Nie, A., Jurafsky, D. and Ng, A.Y., 2017. 5th International Conference on Learning Representations, {ICLR} 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. - Connecting the Dots Between MLE and RL for Sequence Generation
Tan, B., Hu, Z., Yang, Z., Salakhutdinov, R. and Xing, E., 2018. CoRR, Vol abs/1811.09740. - Toward controlled generation of text
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R. and Xing, E.P., 2017. Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1587--1596. - Unsupervised text style transfer using language models as discriminators
Yang, Z., Hu, Z., Dyer, C., Xing, E.P. and Berg-Kirkpatrick, T., 2018. Advances in Neural Information Processing Systems, pp. 7298--7309. - Toward Unsupervised Text Content Manipulation
Wang, W., Hu, Z., Yang, Z., Shi, H., Xu, F. and Xing, E., 2019. arXiv preprint arXiv:1901.09501. - Content Preserving Text Generation with Attribute Controls
Logeswaran, L. and H. Lee, S.B., 2018. In Advances in Neural Information Processing Systems, pp. 5108–5118. - Sequence to sequence learning with neural networks
I. Sutskever, O.V. and Le, Q.V., 2014. In Advances in Neural Information Processing Systems, pp. 3104–3112. - Multiple-attribute Text Rewriting
S. Subramanian, G.L. and Boureau, Y., 2019. International Conference on Learning Representations.