User:Mzajac/Language attributes

Definition from Wiktionary, the free dictionary
Jump to: navigation, search

This is a brainstorming page for adding standard HTML language metadata (e.g., lang="uk" xml:lang="uk") to the workings of script templates.

This would require that {{term}}, {{t}}, {{infl}}, {{form of}}, and their variants provide a lang=xx parameter to the script template. The result would be HTML language metadata in all of their output.


Lang and xml:lang are standard HTML metadata attributes for identifying the language of an element's content. They can be used to style text using CSS, possibly to supplement or replace the classes used in script templates. According to the HTML 4.01 specification, situations where language information may be helpful include assisting search engines, assisting speech synthesizers, helping a user agent select glyph variants for high quality typography, helping a user agent choose a set of quotation marks, helping a user agent make decisions about hyphenation, ligatures, and spacing, and assisting spell checkers and grammar checkers. HTML 5 says that the web browser may use the element's language, e.g., in the selection of appropriate fonts or pronunciations, or for dictionary selection.

An example application would be the use of standardized CSS selectors to style any language, rather than depending on the few classes defined in Wiktionary. For example, to use spaced small caps instead of italics for Ukrainian:

i:lang(uk) { 
    font-size: .75em;
    text-transform: uppercase;
    letter-spacing: .1em;

Accessibility guidelines stress the importance of indicating language. “Clearly identify changes in the natural language of a document's text and any text equivalents (e.g., captions). [Priority 1]” (WCAG 1.0, 1999). “The human language of each passage or phrase in the content can be programmatically determined except for proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text. (Level AA)” (WCAG 2.0, 2008).

Wiktionary should strive to provide language metadata.


The example {{Cyrl}} is used. The essential working code for most templates looks like the example below. ({Cyrl} actually has some #switch code which can turn the span into an i or b, and a deprecated .RU class, ignored here for clarity):

 <span class="Cyrl">{{{1}}}</span>

Show language: the template accepts a lang attribute from its parent (which may be one of {{term}}, {{t}}, {{infl}}, {{form of}}). (en needn't be sent, since English is the English-language Wiktionary's primary language, set in an entry's top-level HTML element.):

 <span class="Cyrl" lang="{{{lang}}}" xml:lang="{{{lang}}}">{{{1}}}</span>

But don't include empty lang attributes:

 <span class="Cyrl" {{ #if: {{{lang|}}} | lang="{{{lang}}}" xml:lang="{{{lang}}}"}}>{{{1}}}</span>

Indicate the language script using a language subtag, like lang="ru-Cyrl"

 <span class="Cyrl" {{ #if: {{{lang|}}} | lang="{{{lang}}}-Cyrl" xml:lang="{{{lang}}}-Cyrl"}}>{{{1}}}</span>

Default script: the script really shouldn't be indicated if it is the usual one for the particular language. So we should see lang="ru", but lang="sr-Cyrl" or lang="sr-Latn" (Russian is normally written in Cyrillic, but Serbian is in both Cyrillic and Latin, so it should have the script specified). This can be handled by a switch statement which filters the default languages:

 <span class="Cyrl" {{ #if: {{{lang|}}} | {{#switch: {{{lang}}}
 | ab | be | bg | kk | mk | ru | uk = lang="{{{lang}}}" xml:lang="{{{lang}}}"
 | #default = lang="{{{lang}}}-Cyrl" xml:lang="{{{lang}}}-Cyrl"

Alternate languages: for the sake of documentation, explicitly name languages which commonly include the script code before the fallback.

 <span class="Cyrl" {{ #if: {{{lang|}}} | {{#switch: {{{lang}}}
 | ab | be | bg | kk | mk | ru | uk = lang="{{{lang}}}" xml:lang="{{{lang}}}"
 | az | bs | mn | sr | tg | uz | #default = lang="{{{lang}}}-Cyrl" xml:lang="{{{lang}}}-Cyrl"


The test template is currently at user:Mzajac/Language attributes/Cyrl. The HTML code is made visible by entering < as &lt;.

No lang
{{Cyrl |слово }}
<span class="Cyrl" >слово</span>
Empty lang
{{Cyrl |слово |lang= }}
<span class="Cyrl" >слово</span>
A default language for Cyrl
{{Cyrl |слово |lang=uk }}
<span class="Cyrl" lang="uk" xml:lang="uk">слово</span>
An ambiguous language for Cyrl
{{Cyrl |слово |lang=sr }}
<span class="Cyrl" lang="sr-Cyrl" xml:lang="sr-Cyrl">слово</span>
An undefined language
{{Cyrl |слово |lang=und }}
<span class="Cyrl" lang="und-Cyrl" xml:lang="und-Cyrl">слово</span>
{{Cyrl |слово |lang=sr-Cyrl }}
<span class="Cyrl" lang="sr-Cyrl-Cyrl" xml:lang="sr-Cyrl-Cyrl">слово</span>
{{Cyrl |слово |lang=NONSENSE! }}
<span class="Cyrl" lang="NONSENSE!-Cyrl" xml:lang="NONSENSE!-Cyrl">слово</span>

What's the best way to deal with incorrect input? Should we add comprehensive error-checking? Should error input be corrected, dropped silently, or throw an error message?

Should the input case be adjusted (EN > en)? This is supposed to be case-insensitive, so changing it is safe, and it may help buggy implementations deal with our content.

If the code grows complex, should it be generalized and maintained in a single template, to be transcluded into any script template?

Complex example[edit]

Rolling {{Arab}} and its partners into a single template (including {{fa-Arab}}, {{ks-Arab}}, {{ku-Arab}}, {{ota-Arab}}, {{pa-Arab}}, {{ps-Arab}}, {{sd-Arab}}, {{ug-Arab}}, and {{ur-Arab}}). This also supports varying class attribute:

 <span dir="rtl" {{ #if: {{{lang|}}} | {{#switch: {{{lang}}}
 | ar = class="Arab" lang="ar" xml:lang="ar"
 | fa | ps | ur = class="{{{lang}}}-Arab" lang="{{{lang}}}" xml:lang="{{{lang}}}"
 | ks | ku | ota | pa | sd | ug = class="{{{lang}}}-Arab" lang="{{{lang}}}-Arab" xml:lang="{{{lang}}}-Arab"
 | az | tg | #default = class="Arab" lang="{{{lang}}}-Arab" xml:lang="{{{lang}}}-Arab"
 | class="Arab"}}>{{{1}}}</span>


List of language subtags[edit]

Data from the IANA subtags registry (2008-11-25) has been extracted to User:Mzajac/Language attributes/IANA subtags. It needs amendments.

Language tags[edit]

w:Language tags: HTML 4.01 specifies the format for language tags to follow rfc:1766 (1995). HTML 5 specifies its replacement, rfc:3066 (2001). The latest specification is rfc:4646 (2006), and another revision is in progress.

No language[edit]

Language codes for no language.[1]

  • zxx – non-linguistic matter (e.g., type samples, part numbers, binary data streams)
  • und – for text of undetermined language

XHTML doesn't allow the empty string (xml:lang=""). HTML 5 says “Setting the attribute to the empty string indicates that the primary language is unknown”.

We should avoid setting empty language tags.

To do[edit]

  • Support for region or variant subtags [need a list of use cases: probably for Arabic script ({{Arab}} et al) and Cuneiform ({{Xsux}})]
  • Compile list of languages by script, or at least languages which conventionally use multiple scripts [started, at /IANA subtags]
  • Flag unusual language–script combinations [with a hidden category?]
  • Allow explicitly setting an empty language tag (meaning “unknown language”) [do we need this?]
  • Filter out lang="en", which needn't be included because this is the English-language Wiktionary's primary language, set in an entry's top-level HTML element [not required]
  • Generalize the code for any script template by using {{PAGENAME}} instead of “Cyrl” for both the class and lang attributes. [probably pointless]
  • Use the #language: parser function to generalize the code for any Wiktionary [probably not worth it]


External links[edit]