Eliciting Secret Knowledge from Language Models
Best AI papers explained - En podcast av Enoch H. Kang

Kategorier:
This academic paper investigates the critical challenge of eliciting secret knowledge from Large Language Models (LLMs) that have been intentionally trained to possess and conceal specific information. The researchers created a controlled testbed with three "secret-keeping" LLMs—Taboo, Secret Side Constraint (SSC), and User Gender—each hiding a different type of fact. They evaluated various black-box techniques, such as prefill attacks and user persona sampling, and white-box techniques, including Logit Lens and Sparse Autoencoders (SAEs), to see which methods most successfully enabled an auditor LLM to guess the secret. The findings demonstrate that both black-box prefilling methods and white-box mechanistic interpretability tools significantly improve the auditor's success rate in uncovering the models' hidden knowledge. The authors conclude by open-sourcing their code and models to establish a public benchmark for future AI safety research in this area.