Module:data consistency check/documentation

	Documentation for Module:data consistency check. ^[edit]
	This page contains usage information, categories, interwiki links and other content describing the module.

This module checks the validity and internal consistency of the language, language family, and script data used on Wiktionary: the modules in Category:Language data modules as well as Module:scripts/data.

Output

Discrepancies detected:

Module:etymology languages/canonical names

Literary Chinese, the canonical name for the code lzh-lit, is wrong; it should be Literary Chinese.

Module:etymology languages/code to canonical name

Literary Chinese, the canonical name for the code lzh-lit, is wrong; it should be Literary Chinese.

Module:etymology languages/data

Literary Chinese language (lzh-lit) has a canonical name that is not unique; it is also used by the code lzh.
The data key preprocess_links for ??? (th-new) is invalid.

Module:families/canonical names

The code ira-mid and the canonical name Middle Iranian should be removed; they are not found in Module:families/data.
The code ira-old and the canonical name Old Iranian should be removed; they are not found in Module:families/data.

Module:families/code to canonical name

The code ira-mid and the canonical name Middle Iranian should be removed; they are not found in Module:families/data.
The code ira-old and the canonical name Old Iranian should be removed; they are not found in Module:families/data.

Module:families/data

Old Indo-Aryan family (inc-old) has no child families or languages.

Module:languages/data/2

Norwegian Bokmål language (nb) has Middle Norwegian language (gmq-mno) set as an ancestor, but is not in the West Scandinavian family (gmq-wes).
Norwegian Bokmål language (nb) has Danish language (da) set as an ancestor, but is not in the East Scandinavian family (gmq-eas).

Module:languages/data/3/h

Caribbean Hindustani language (hns) has Bhojpuri language (bho) set as an ancestor, but is not in the Bihari family (inc-bih).
Caribbean Hindustani language (hns) has Awadhi language (awa) set as an ancestor, but is not in the Eastern Hindi family (inc-hie).

Module:languages/data/exceptional

Proto-Central Togo language (alv-gtm-pro) does not have the expected name "Proto-Ghana-Togo Mountain", even though it is the proto-language of the Ghana-Togo Mountain languages (alv-gtm).
Proto-Arawa language (auf-pro) does not have the expected name "Proto-Arauan", even though it is the proto-language of the Arauan languages (auf).
Proto-Amuesha-Chamicuro language (awd-amc-pro) has a proto-language code associated with the invalid code awd-amc.
Proto-Kampa language (awd-kmp-pro) has a proto-language code associated with the invalid code awd-kmp.
Proto-Arawak language (awd-pro) does not have the expected name "Proto-Arawakan", even though it is the proto-language of the Arawakan languages (awd).
Proto-Paresi-Waura language (awd-prw-pro) has a proto-language code associated with the invalid code awd-prw.
Proto-Ta-Arawak language (awd-taa-pro) does not have the expected name "Proto-Ta-Arawakan", even though it is the proto-language of the Ta-Arawakan languages (awd-taa).
Proto-Rukai language (dru-pro) has a proto-language code associated with Rukai (dru), which is not a family.
Proto-Basque language (euq-pro) does not have the expected name "Proto-Vasconic", even though it is the proto-language of the Vasconic languages (euq).
Proto-Norse language (gmq-pro) does not have the expected name "Proto-North Germanic", even though it is the proto-language of the North Germanic languages (gmq).
Proto-Kamta language (inc-krn-pro) does not have the expected name "Proto-KRNB lects", even though it is the proto-language of the KRNB lects (inc-krn).
Proto-Chumash language (nai-chu-pro) does not have the expected name "Proto-Chumashan", even though it is the proto-language of the Chumashan languages (nai-chu).
Proto-Maidun language (nai-mdu-pro) does not have the expected name "Proto-Maiduan", even though it is the proto-language of the Maiduan languages (nai-mdu).
Proto-Mixe-Zoque language (nai-miz-pro) does not have the expected name "Proto-Mixe-Zoquean", even though it is the proto-language of the Mixe-Zoquean languages (nai-miz).
Proto-Pomo language (nai-pom-pro) does not have the expected name "Proto-Pomoan", even though it is the proto-language of the Pomoan languages (nai-pom).
Proto-Mazatec language (omq-maz-pro) does not have the expected name "Proto-Mazatecan", even though it is the proto-language of the Mazatecan languages (omq-maz).
Proto-Ossetic language (os-pro) has a proto-language code associated with Ossetian (os), which is not a family.
Proto-North Sarawak language (poz-swa-pro) does not have the expected name "Proto-North Sarawakan", even though it is the proto-language of the North Sarawakan languages (poz-swa).
Proto-Salish language (sal-pro) does not have the expected name "Proto-Salishan", even though it is the proto-language of the Salishan languages (sal).
Proto-Samic language (smi-pro) does not have the expected name "Proto-Sami", even though it is the proto-language of the Sami languages (smi).
Proto-Kuki-Chin language (tbq-kuk-pro) does not have the expected name "Proto-Kukish", even though it is the proto-language of the Kukish languages (tbq-kuk).
Proto-Saka language (xsc-sak-pro) does not have the expected name "Proto-Sakan", even though it is the proto-language of the Sakan languages (xsc-sak).

Module:languages/data/wikidata.json

apc is set as an ISO 639-3 code on multiple items: Q56593 and Q22809485.
kjv is set as an ISO 639-3 code on multiple items: Q838165 and Q31199873.
msn is set as an ISO 639-3 code on multiple items: Q3331111 and Q3563857.
ttt is set as an ISO 639-3 code on multiple items: Q56489 and Q123964178.

Module:scripts/data

Blissymbols script (Blis) is not used by any language and has no characters listed for auto-detection.
Cypro-Minoan script (Cpmn) is not used by any language.
Hiragana script (Hira) is not used by any language.
Kana script (Hrkt) is not used by any language.
Image-rendered script (Image) is not used by any language and has no characters listed for auto-detection.
International Phonetic Alphabet script (Ipach) is not used by any language and has no characters listed for auto-detection.
Moon script (Moon) is not used by any language and has no characters listed for auto-detection.
Morse code (Morse) is not used by any language and has no characters listed for auto-detection.
Musical notation script (Music) is not used by any language.
Unspecified script (None) is not used by any language and has no characters listed for auto-detection.
Rongorongo script (Roro) is not used by any language and has no characters listed for auto-detection.
Rumi numerals script (Rumin) is not used by any language.
flag semaphore (Semap) is not used by any language and has no characters listed for auto-detection.
Visible Speech script (Visp) is not used by any language and has no characters listed for auto-detection.
mathematical notation script (Zmth) is not used by any language.
symbol script (Zsym) is not used by any language.
undetermined script (Zyyy) is not used by any language and has no characters listed for auto-detection.
uncoded script (Zzzz) is not used by any language and has no characters listed for auto-detection.
The codes fa-Arab, ug-Arab, ks-Arab, ps-Arab, ur-Arab, tt-Arab, ota-Arab, ku-Arab, mzn-Arab and sd-Arab are currently alias codes. Only one code should be used in the data.
The codes ms-Arab and kk-Arab are currently alias codes. Only one code should be used in the data.
The data key sort_by_scraping for Japanese script (Jpan) is invalid.

Checks performed

For multiple data modules:

Codes for languages, families and etymology-only languages must be unique and cannot clash with one another.
Canonical names for languages, families, and etymology-only languages must not be found in the list of other names.
Each name in the list of other names must appear only once.
otherNames, if present, must be an array.
Wikidata item IDs must be a positive integer or a string starting with Q and ending with decimal digits.

The following must be true of the data used by Module:languages:

Each code must be defined in the correct submodule according to whether it is two-letter, three-letter or exceptional.
The canonical name (field 1) must be present and must not be the same as the canonical name of another language.
If field 2 is not nil, it must a valid Wikidata item ID.
If field 3 or family is given and not nil, it must be a valid family code.
If field 4 or scripts is given and not nil, it must be an array, and each string in the array must be a valid script code.
If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code.
If family is given, it must be a valid family code.
If type is given, it must be one of the recognised values (regular, reconstructed, appendix-constructed).
If entry_name is given, it must be a table that contains either two arrays (from and to) or a string (remove_diacritics) or both.
If sort_key is given, it may either be a string, or at table that in turn contains either two arrays (from and to) or a string (remove_diacritics).
If entry_name or sort_key is given, the from array must be longer or equal in length to the to array.
If standardChars is given, it must form a valid Lua string pattern when placed between square brackets with ^ before it ("[^...]). (It should match all characters regularly used in the language, but that cannot be tested.)
If override_translit is set, translit must also be set, because there must be a transliteration module that can override manual transliteration.
If link_tr is present, it must be true.
Have no data keys besides these: 1, 2, 3, "entry_name", "sort_key", "display", "otherNames", "aliases", "varieties", "type", "scripts", "ancestors", "wikimedia_codes", "wikipedia_article", "standardChars", "translit", "override_translit", "link_tr".

Checks not performed:

If translit is present, it should be the name of a module, and this module should contain a tr function that takes a pagename (and optionally a language code and script code) as arguments.
If sort_key is a string, it should be the name of a module, and this module should contain a makeSortKey function that takes a pagename (and optionally a language code and script code) as arguments.
If entry_name or sort_key is a table and contains a field remove_diacritics, the value of the field should be a string that forms a valid Lua pattern when it is placed inside negated set notation ([^...]).

These are not checked here, because module errors will quickly crop up in entries if these conditions are not met, assuming that Module:utilities attempts to generate a sortkey for a category pertaining to the language in question, or full_link attempts to use the transliteration module.

Module:languages/code to canonical name and Module:languages/canonical names must contain all the codes and canonical names found in the data submodules of Module:languages, and no more.

The following must be true of the data used by Module:etymology languages:

canonicalName must be given.
parent must be given must be a valid language, family or etymology-only language code.
If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code. The etymology language should also be listed as the ancestor of a regular language.
Have no data keys besides these: "canonicalName", "otherNames", "parent", "ancestors", "wikipedia_article", "wikidata_item".

Codes in Module:families data must:

Have canonicalName, which must not be the same as the canonical name of another family.
If family is given, it must be a valid family code.
Have at least one language or subfamily belonging to it.
Have no data keys besides these: "canonicalName", "otherNames", "family", "protoLanguage", "wikidata_item".

Codes in Module:scripts data must:

Have canonicalName.
Have at least one language that lists it as one of its scripts.
Have a characters pattern for script autodetection, and this must form a valid Lua string pattern when placed between square brackets ("[...]"). (It should match all characters in the script, but that cannot be tested.)
Have no data keys besides these: "canonicalName", "otherNames", "parent", "systems", "wikipedia_article", "characters", "direction".

Module:data consistency check/documentation

Contents

Output

Module:etymology languages/canonical names

Module:etymology languages/code to canonical name

Module:etymology languages/data

Module:families/canonical names

Module:families/code to canonical name

Module:families/data

Module:languages/data/2

Module:languages/data/3/h

Module:languages/data/exceptional

Module:languages/data/wikidata.json

Module:scripts/data

Checks performed

Navigation menu

Module:data consistency check/documentation

Output

Checks performed

Navigation menu

Search