alignment tax

From Wiktionary, the free dictionary
Jump to navigation Jump to search



First attested in a 2019 speech by computer scientist Paul Christiano (see quote), who attributed the idea to AI researcher and writer Eliezer Yudkowsky.


alignment tax (plural alignment taxes)

  1. (artificial intelligence) A cost to the capabilities of an artificial intelligence resulting from the effects of aligning it with human ethics and morality. [from late 2010s]
    • 2019 August 29, Paul Christiano, Current work in AI alignment[1], EA Global San Francisco 2019:
      I like this notion of an "alignment tax" [] the reason I might compromise is if there's some tension, between having the AI that's robustly trying to do what I want, and having the AI that is competent or intelligent, and the alignment tax is intended to capture that gap—that cost that I incur if I insist on alignment.
    • 2021 December 1, Askell, A. et. al., “A General Language Assistant as a Laboratory for Alignment”, in arXiv[2], →DOI:
      The fact that larger models are less subject to forgetting may be related to the fact that larger models do not incur significant alignment taxes.
    • 2022 March 4, Ouyang, L. et. al., “Training language models to follow instructions with human feedback”, in arXiv[3], →DOI:
      We want an alignment procedure that avoids an alignment tax, because it incentivizes the use of models that are unaligned but more capable on these tasks.
    • 2023 February 27, Kornai, A. et. al., “Safety without alignment”, in arXiv[4], →DOI:
      We note that instead of an alignment tax our proposal entails a safety dividend – the more rational the system the more capable and the safer it will be.