Feb 27 2024

Large Language Models (LLMs) have demonstrated remarkable capabilities across
various tasks. However, they sometimes suffer from producing hallucinations,
particularly in cases where they may generate untruthful responses despite
possessing the correct knowledge. In this paper, we propose TruthX, an
inference-time method to elicit the truthfulness of LLMs by editing their
internal representations in truthful space. TruthX employs an auto-encoder to
map LLM's representations into semantic and truthful latent spaces
respectively, and applies contrastive learning to identify a truthful editing
direction within the truthful space. During inference, by editing LLM's
internal representations in truthful space, TruthX effectively enhances the
truthfulness of LLMs. Experiments show that TruthX effectively improves the
truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark.
Further analyses suggest that the truthful space acquired by TruthX plays a
pivotal role in controlling LLM to produce truthful or hallucinatory responses.