You Don't Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers' Private Personas
Apr 26 2022
Social chatbots, also known as chit-chat chatbots, evolve rapidly with large
pretrained language models. Despite the huge progress, privacy concerns have
arisen recently: training data of large language models can be extracted via
model inversion attacks. On the other hand, the datasets used for training
chatbots contain many private conversations between two individuals. In this
work, we further investigate the privacy leakage of the hidden states of
chatbots trained by language modeling which has not been well studied yet. We
show that speakers' personas can be inferred through a simple neural network
with high accuracy. To this end, we propose effective defense objectives to
protect persona leakage from hidden states. We conduct extensive experiments to
demonstrate that our proposed defense objectives can greatly reduce the attack
accuracy from 37.6% to 0.5%. Meanwhile, the proposed objectives preserve
language models' powerful generation ability.