Wiktionary talk:Votes/2012-04/Languages with limited documentation

Definition from Wiktionary, the free dictionary
Jump to navigation Jump to search


According to the Linguistic Society of America, as of 2009, at least a portion of the Bible had been translated into 2,508 languages by SIL, an organization that seeks to bring Christianity to all cultures of the world. SIL provides the most comprehensive listing of languages at its Ethnologue site, which currently provides records for more than 6,900 languages.

It is difficult to enumerate how many languages are "unwritten" or do not have a strong written tradition (UNESCO), but there are many such languages (LSA, Wycliffe Last Languages Campaign]). Prince Kassad (i.e., Liliana-60) cites 5,000 languages without a written tradition. Out of 6,900 languages, if 2,500 have at least a portion of the Bible translated, then 5,000 is an overstatement; however, if "without a written tradition" refers to situations where writing is not a common practice, this may very well be a reasonable number. In any case, the number of languages without a strong written record is significant, and they should be included in the English Wiktionary (and other Wiktionaries).

The attestation criterion 3 of the criterion for inclusion requires that for inclusion in Wiktionary, a word be attested in at least three independent sources. Since at least December 2007, there has been discussion on lowering the number of citations required in certain instances. In May 2011, a vote was passed to allow inclusion of words in extinct languages with only one usage.


For languages with limited written sources, the minimum requirement of three citations cannot be fulfilled to attest words. Working within Wiktionary's spirit of including all words in all languages, some Wiktionarians are currently in the practice of adding words they know cannot meet the three-citation requirement.

Basic proposed solution[edit]

(Italics indicates text added on April 27)

  • Lower the minimal requirement of citations to one for languages with limited online durably recorded written sources

In the recently failed vote for sparsely documented languages, Prosfilaes argues that allowing zero permanently durable citations is unacceptable, with sign languages being a special exception that should not apply to oral languages. Prosfilaes also notes that "...publishing texts on Usenet would be a step above just writing definitions here"; that is, at least one durably archived source should be required for attestation of every word on the English Wiktionary.

Community rules[edit]

Lowering the number of required sources to only one creates a much greater potential for abuse. Someone wanting to include a word on Wiktionary can merely make a post on Usenet and then use that as the source for inclusion.

To counterbalance that potential, this proposal includes a provision for each language community to maintain a list of sources that are not deemed acceptable for attestation.

(boldface added on 27 April) To help ensure the integrity of content on Wiktionary, this proposal includes the following provisions:

  • the language community should maintain a list of materials deemed appropriate as the sole source for entries based on a single mention,
  • each entry should have its source(s) listed on the entry or citation page, and
  • a box explaining that a low number of citations were used should be included on the entry page.

Language community rules currently exist. For example, Middle French is defined by that community as falling between the "somewhat arbitrary" years of 1400 and 1600. Old Saxon and Old High German have suggested guidelines for spelling in addition to spans of years that words must be attested in. To handle the complex orthography, Japanese has a set of guidelines as well.

It should be noted that Dan Polansky dislikes the idea of allowing CFI criteria to be language-specified because "...the pages are not locked from editing without voting, and this method allows proliferation of disparate language-specific CFIs when a more compact CFI ranging across languages is possible." This concern is also discussed in the November 2010 conversation on Ancient Greek attestation.

Usage and mention[edit]

  • Mentions versus usage - i.e., whether a word that is merely mentioned in a text can be used for attestation as opposed to actual usage in a sentence.
  • In the abstain section of the May 2011 vote on extinct language attestation, Dan Polansky is concerned that allowing mentions is too wide of a scope.

This proposal allows both usagesuses and mentions.

Elasticity of the CFI[edit]

As SemperBlotto pointed out last September, the highest principle of Wiktionary is "all words in all languages." In that same conversation, Mglovesfun points out that the Criteria for inclusion (CFI) starts out with that principle, but then the rest of the document is spent discussing restrictions. The CFI creates a balance between including every single written word and excluding one-time misspellings and coinages not likely to be of use to anyone.

Liliana (i.e., Liliana-60) states that "some rules are best left unwritten", which appears to be in reference to the elasticity of the CFI. See examples 1 and 2 in particular for words already in Wiktionary that cannot meet the CFI.

This proposal makes the rules explicit while allowing some elasticity through the requirement of language communities maintaining a list of which sources are unacceptable as the sole attestation acceptable as single sources for words. makes it possible to better meet the goal of "all words in all languages" while striving to maintain the integrity of content by requiring that all terms in limited documentation languages have citations.

Further reading[edit]

BenjaminBarrett12 (talk) 21:28, 11 April 2012 (UTC)

Other languages without a strong written tradition[edit]

I'm seriously impressed with this proposal. The single question I have is about "Other languages without a strong written tradition". How do we define this? If you can find an acceptable way to define this, you have my support. It could simply be via discussion or votes as opposed to some sort of external citation. It's not necessarily as difficult to define as you might think. Mglovesfun (talk) 11:27, 12 April 2012 (UTC)

I'm glad it looks better :) You mention both defining "a strong written tradition" and using a process. In the discussion for the predecessor for this vote, Metaknowledge says that Tok Pisin should qualify. Although I actually agree this is probably a good candidate, I noted there that there are newspapers and a written grammar for the language. The way it is written, the community is responsible for demonstrating how it qualifies as not having "adequate durably archived sources." The process is that the community comes up with justification that is used if someone makes a challenge, similar to how the RFV process works for words. For Tok Pisin, the newspaper coverage might be so scant as compared to the language overall that it might qualify. Or maybe that community might decide that a minimum of two sources is their solution. Each language situation is unique, so I don't want to impose a number like "no more than three newspapers" or something like that, but I agree that making this part of the stronger would be desirable and welcome alternative wording. BenjaminBarrett12 (talk) 18:33, 12 April 2012 (UTC)
I agree with Mglovesfun, "Other languages without a strong written tradition" is a weak point of this otherwise well-crafted vote; I worry that people may vote against this because of the vagueness of that phrase. Of course, it would be very difficult to compile a list of such languages, especially to try to account for all ‘sparsely-written’ languages at one go. It may be necessary to do what I did with the second vote on romanisations of certain languages (after the vaguer first vote failed) and vote in a small number of the languages which we expect to see contributors of in the nearest future, and vote in new languages later as needed (like we are currently discussing in the BP voting to allow romanisations of Egyptian). - -sche (discuss) 19:30, 12 April 2012 (UTC)
Would it work to require languages get a vote in advance of being allowed to use the "without a strong written tradition" clause? That would not tighten the language, but perhaps it would address concerns of people running amok. I didn't do it that way because I thought it would be a burden, but I have no problem with it in general. I'm not personally prepared to create any sort of list. Although I know there are likely others, Tok Pisin is the only language I can think of that might fit into this category. Most others fall into the endangered language category. BenjaminBarrett12 (talk) 19:56, 12 April 2012 (UTC)
The fact that you (and I) can only think of one language that isn't endangered but doesn't have a strong written tradition may be a good reason not to institute a general rule allowing all such languages(!); compare the concerns expressed about the vagueness of the first romanisation vote. Perhaps the text could be:
These other languages without a strong written tradition: Tok Pisin.
(and any others that can be thought of before the vote starts)
No text needs to be included saying that languages can be added to that list by votes, because languages can always be added and removed via votes. - -sche (discuss) 20:10, 12 April 2012 (UTC)
LOL. That solves it, thank you! I'm going to search for other languages. I'm very certain a number exist in Africa, languages with hundreds of thousands to millions of people but with little written tradition, but I'm not very familiar with the situation there. If I get a large number, we can decide whether this method burdens the CFI page too much.
I'm adding Tok Pisin as you recommend and deleting the sentence "The community of each language falling under the category of not having a strong written tradition is responsible for providing justification for qualification" as it no longer applies. This can be re-thought if a lot of languages turn up. BenjaminBarrett12 (talk) 21:59, 12 April 2012 (UTC)
As far as I can tell, every pidgin or creole language in existance, in the entire world, has limited documentation. The reason for this is best illustrated by Tok Pisin, which is a creole. Although a few million people speak it, it is not an official language, and is considered to be lower status than English. With members of one's own specific ethnic group, the native language would be used, and in published materials coming from urban centers, English would be used. This makes a pretty big list of non-endangered languages with limited documentation.--Μετάknowledgediscuss/deeds 00:57, 13 April 2012 (UTC)
I still have to do the work, but I now have a lead that will likely yield at least a dozen languages. Also, Jamaican English should probably be added. I don't think listing languages is going to be reasonable because of the quantity, including pidgins and creoles, but I would like to get a list before making a determination of how to resolve this. BenjaminBarrett12 (talk) 01:16, 13 April 2012 (UTC)
I'm thinking of supporting the vote, even if it's vague. I'm confused about the list of languages though. I'd like languages Sinhalese or Lao being included (+Khmer, Burmese). Today I struggled to find the translation of New Delhi into Lao. If I find it, I'd like it to be allowed even if it's just one source. Can we add the four languages I listed? ---Anatoli (обсудить) 00:31, 13 April 2012 (UTC)
I see that the Laotian government has a website in Lao [[1]] and probably Sri Lanka, Cambodian and Myanmar have the same, which indicate that there is a written tradition. But perhaps this is a good opportunity to test out the "justification" clause I just deleted. Can you think of strong reasons for including all (or any) of those four languages as not having adequate durably archived sources? BenjaminBarrett12 (talk) 01:16, 13 April 2012 (UTC)
The internet penetration is very low, even if there is some, so finding a durable (internet) source is either too difficult or impossible. As for the transliteration, as all these languages are not Roman based - the transliteration may also be whatever (non-standard or substandard). I'm not saying these languages have NO documentation at all but it's very limited. Sinhalese and Burmese scripts only recently got support by software producers. Still, with even Windows 7 you need to download Burmese fonts to be able to see Burmese. Default Khmer fonts on any version of Windows are tiny. --Anatoli (обсудить) 01:59, 13 April 2012 (UTC)
I agree with you, but how can we make a convincing case (and then generalize for other languages)? Can you provide a quantity of use of those languages on Usenet? And how many books written in those languages are on Google Books? Are there lots of books and newspapers written in those languages in those countries? Are those books and newspapers available online or in major libraries? BenjaminBarrett12 (talk) 02:04, 13 April 2012 (UTC)
The languages of South Asia and Indo-China have a very long history of writing, but their alphabets/abjads/syllabaries are very complex and until just the past ten years almost nobody could type them or typeset them. Almost everything had to be handwritten. The first computer fonts for some of these languages such as Khmer only appeared around 1990, and they were very difficult to use. Native Khmer scholars who were happy to type in English would flatly refuse to attempt typing in Khmer. My translating company used to charge 25 cents per word to translate Khmer, but the result would be handwritten. If anybody insisted on typed copy, we charged an additional $10 per word. Even for Vietnamese, the big Vietnamese newspapers (such as the Saigon Times) would print the entire newspaper without any diacritics (which makes Vietnamese almost impossible to read), after which a team of editors would go through and add the diacritics by hand with a fountain pen.
We sent font specialists to India, Cambodia, and other nations in the 1990s to try to develop a system that would allow those languages to be typed reasonably conveniently, and so now, only in the past decade, most of these languages can be typed intuitively by native speakers. Khmer is still quite difficult, but is vastly improved. Well, it was a similar story with Chinese and Japanese. Until just the past decade, almost everything in those countries was handwritten. All business cards were handwritten. Only a handful of extreme specialists could typeset Chinese or Japanese. The typesetting machines used huge glass plates with the character images in a vast grid, and they would search for and find one character amid the 10,000 or so on the plates, line it up under the lens, then flash a light through it onto film beneath the plate. Our company charged 25 cents a word to translate Chinese or Japanese, but $1.00 per character to type it (approximately $2.00 per word). All these great keyboards for Japanese and Chinese, such as Pinyin or Cangjie, are very recent. Until the past 20 years, it took a single Chinese typographer his entire lifetime to produce a single Chinese font, which is why so few fonts even existed for those languages.
So even though all those languages have had many centuries of written literature, their writting generally did not become mechanized or computerized until just the past decade. —Stephen (Talk) 11:16, 14 April 2012 (UTC)
Thank you for the input, Stephen. The difficulty with the input and attitudes of governments and some individuals made online contents very scarce (though there are many printed books in them). Online dictionaries and textbooks are also rare or poor. So, the issue at hand is about relaxing some CFI rules for certain group of languages. Do you support the idea that South Eastern languages with complex scripts need special treatment? (I think there's significant improvement with Thai and Vietnamese, so I didn't mention these but I'd also like to add Sinhalese, not the situation with which is actually worse than with Khmer, Lao and Burmese). --Anatoli (обсудить) 11:38, 14 April 2012 (UTC)
At the moment it's hard for me to make a convincing case, although the internet statistics must be somewhere (which may prove me wrong in some cases). I occasionally try to fill in gaps (or ask others to do it) in state languages with millions of speakers with low contents in Wiktionary and these 4 give me most headaches. I only have limited interest in them but I'd like to have some framework for adding languages with low contents. Low contents is due not just to the lack of enthusiasts but to the lack of good and comprehensive online dictionaries. I might drop the request if it's too hard to include in this vote. --Anatoli (обсудить) 02:30, 13 April 2012 (UTC)
Spitballing: if we anticipate that there will be many languages without written traditions (an odd phrase, if you think about it...it sounds like they don't write down their traditions, not that they lack a tradition of writing, but at this point I'm just thinking aloud), it might be preferable to establish a subpage of CFI to list them all, and instead of adding These other languages without a strong written tradition: Tok Pisin., add something like The sparsely attested languages listed here. This could also allow us to hold separate votes on different languages or batches of languages, so that no-one would be wary of supporting, or be motivated to oppose, this big vote just because they disagreed that a particular language lacked a tradition of writing. Your thoughts, everyone? - -sche (discuss) 02:41, 13 April 2012 (UTC)
So basically just saying the vote has to be held before a language can qualify for not having a strong written tradition. That seems fine. I can reword that criterion so nothing is included right now and then hold a vote on a list of languages if this vote goes through. My proposed wording would be something like: Other languages without a strong written tradition - languages voted as qualifying and listed at (link). Does that work? BenjaminBarrett12 (talk) 03:46, 13 April 2012 (UTC)
Ha, I realize that does bring us (almost) back to your earlier point of modifying CFI to say that other languages could be included/excluded based on votes. (I hope this circuity isn't making you want to tear your hair out.) Hm, fine by me; what does anyone else think? I'll post a pointer to this discussion in the BP. - -sche (discuss) 03:57, 13 April 2012 (UTC)
LOL. Not at all. Trying to find that sweet gray spot where we allow languages in while still maintaining a hold on abuse is really hard. Thanks for the pointer. That will hopefully show people the reasoning behind putting languages to vote. BenjaminBarrett12 (talk) 04:02, 13 April 2012 (UTC)
Must we list them all? I understand that some, like Yoruba, perhaps, need a vote, but I think we can just say "all pidgins, creoles, and patois with ISO codes" and not bother listing them all, which is a sizable task. --Μετάknowledgediscuss/deeds 04:06, 13 April 2012 (UTC)
Yoruba is a great addition! In addition to Tok Pisin, I was thinking the Nilo-Saharan and Niger-Congo languages must have a lot of languages that should qualify. How about this: "Other languages without a strong written tradition - languages voted as qualifying and listed at WT:CFI/languages with limited documentation; all pidgin, creole, koiné and patois languages with an ISO code are assumed to meet this criterion." Is there a way to work in something like "languages of Africa, Southeast Asia and Oceania," too?
BTW, I just received an e-mail saying that Tok Pisin is very well described, has regular newspapers and very active social media groups. So it needs to be disqualified. Can that be worked in, too, or put in a separate vote...? BenjaminBarrett12 (talk) 05:17, 13 April 2012 (UTC)
I have no access to the physical newspapers and have not found them online yet (I searched under niuspepa). Social media groups do not constitute durable citations. I'm not saying that Tok Pisin terms aren't citable, but I can't cite most of them by Wiktionary's standards. --Μετάknowledgediscuss/deeds 05:29, 13 April 2012 (UTC)
I think this goes into new territory by saying, "Those of us currently contributing to Wiktionary do not have ready access to durably archived sources and therefore the three citation criterion should be waived." That's an area I purposefully avoided, but let's discuss that, too :) Perhaps in a new section on this page? BenjaminBarrett12 (talk) 05:38, 13 April 2012 (UTC)
Yes, some generic conditions are fine but let's vote on other additional languages in need of more more focus. Any relaxation of conditions would be good for poorly documented languages or languages with few resources. --Anatoli (обсудить) 06:04, 13 April 2012 (UTC)

I must admit I have not bothered to read the discussion, but having to vote on every single sparsely populated language is going to be very cumbersome. Consider that water alone covers 1,500 languages, and even if 33% of these were common, that'd still leave 1,000 languages you'd need to vote for. -- Liliana 22:59, 13 April 2012 (UTC)

I'm not sure what the water example illustrates, but there are probably 1000 languages or more that have a weak written tradition without being endangered. To alleviate that problem, Μετάknowledge has suggested using the categories of creoles, pidgins and patois as prequalified. Building on that, I suggested geographical regions such as Africa, Oceania and Southeast Asia, but nobody has commented on that idea. It's difficult to be inclusive without going too far. Any ideas? BenjaminBarrett12 (talk) 01:07, 14 April 2012 (UTC)

(@Liliana) I don't mean to suggest we vote on each language separately from every other language, just that we vote on languages separately from the general rule. I was thinking we could put entire slates of languages up for discussion, see if anyone objected to specific ones, and then vote on the noncontroversial ones in batches of whatever size is appropriate. - -sche (discuss) 02:54, 14 April 2012 (UTC)

inappropriate as a sole source[edit]

"The community of each language qualifying as having limited documentation should maintain a list of durably archived sources deemed inappropriate as a sole source of attestation." What is the rationale for this sentence? Communities of speakers of languages which few books (etc) are written in should decide that some of those books aren't acceptable? Why? If this is designed to allow the exclusion of unreliable books, like the one which was the source of many recently-deleted Aleut words, I don't think the sentence is necessary; we exclude unreliable sources on general principle (as we did in the case of Aleut). - -sche (discuss) 19:11, 12 April 2012 (UTC)

Aha, re-reading this page, I see you've addressed this above. - -sche (discuss) 19:19, 12 April 2012 (UTC)
However, the sentence worries me and will probably worry other descriptivists: the sentence allows editors to decide that some attested uses of their language are sub-par and don't count. - -sche (discuss) 19:25, 12 April 2012 (UTC)
The list is so people know what needs cross-verification with other sources and brings accountability to balance the loose one-source criterion. Is this list somehow different from this "general principle" you mention? I would think that a general principle like that is also prescriptivist. BenjaminBarrett12 (talk) 19:49, 12 April 2012 (UTC)
Ah, it's good that you press me on that: the Aleut discussion is here (archive-proof permalink [2]), and the distinction I neglected is that the deleted Aleut words were mentions in secondary sources (words in dictionaries), not uses in primary sources. I presume we would have allowed the Aleut words in question if we had been able to find them used in primary sources, even if we had tagged each as an {{obsolete spelling of}}. The "inappropriate as a sole source" sentence seems to disallow both mentions (to the extent that this vote otherwises allows mentions) and uses in primary sources. Perhaps it would be better not to try to disallow questionable words from Usenet, but just to tag them as {{nonstandard}} or worse, when appropriate? - -sche (discuss) 20:05, 12 April 2012 (UTC)
Thank you, thank you! I had done a quick search for the Aleut issue, but didn't dig deeper when it did not turn up. Separately, I changed the word "inappropriate" to "appropriate." As it was, I would have voted against the proposal because it could have been used to ostracize someone who had written something but the others consider inappropriate.
But now there's the problem that Ancient Greek and other extinct languages need to create a list. Many words are mentions, not usages, such as ἀΐσσω, making a lot of violations of the current CFI rule for extinct languages. I see that Wiktionary:About_Ancient_Greek#Attestation says only one citation is needed, but nothing is mentioned about the distinction between usage and mention.
As for the tag, I'm thinking along the same lines. The Ancient Greek words, like ἀΐσσω provide careful documentation showing their source. That's something that would be great. My thought is to change the list requirement to requiring the sources be made explicit when coming from only one or two sources; in my ideal world, it would be in the form of a tag like the 1913 Webster's template. That would take care of Ancient Greek (and probably other extinct languages) and still provide for the integrity of Wiktionary. BenjaminBarrett12 (talk) 22:29, 12 April 2012 (UTC)
Yeah, I think "The community of each language qualifying as having limited documentation should maintain a list of durably archived sources deemed appropriate as a sole source of attestation." can be removed; it would, as you note, burden Ancient Greek, Latin, etc. I'm not sure whether we need to explicitly add to CFI a requirement that sources be cited for these languages: we could just (leave it out, and) RFV#Verb (and courteously ask editors about) entries which are entered without citations. - -sche (discuss) 00:19, 13 April 2012 (UTC)
My inclination is that when less than three sources are used, each must be made explicit. That's already being done for Ancient Greek as it should be, and it provides a relatively easy way to delete all words from a source that has been found to be faulty (such as cases like Aleut). Do you think rewording it like that would be acceptable? BenjaminBarrett12 (talk) 01:42, 13 April 2012 (UTC)
Many Latin entries (and, IIRC, also Ancient Greek entries) are created without citations — citations of them are available somewhere, but the creators of the entries don't bother to add the citations into the entries. (horreo had no citations from 2008 until I RFVed it and SemperBlotto cited it, today.) I'd suggest that any requirement that entries be created with citations in them apply only to endangered and 'not-traditionally-written' languages (the categories of entries that we are, with this vote, 'lowering the bar' for), not to extinct languages... Perhaps our Ancient Greek and Latin editors (Atelaes, SemperBlotto, EncycloPetey, et al.) can chime in, but I figure there's no need to change the current criteria for extinct languages, given that those criteria haven't been shown to need changing. Note that requiring the creators of Latin and Ancient Greek entries to add citations when they created entries would actually be making it more bother to add those entries (whereas the rest of this vote is making it easier / more possible to add entries in endangered and sparsely-written languages). - -sche (discuss) 02:26, 13 April 2012 (UTC)
Horreo appears to still not satisfy the CFI because the citation is one mention, not one usage, and the citations are of different conjugations. Of course, this vote would make one usage acceptable. As noted at Wiktionary feedback, single sources (and derivative works) can cause problems. We saw that with Aleut as well. My inclination is to keep the requirement for single usages and mentions for extinct languages as well to maintain integrity of the data. BenjaminBarrett12 (talk) 05:33, 13 April 2012 (UTC)

FWIW, I looked and most Ancient Greek words cite at least LSJ, which provides actual citations. Old English and Latin words, however, have very few citations. Combined, I looked at probably more than 50 words and found only two to three citations (and those were usages). I looked at just a few Sanskrit words, and found they had one usage citation. A requirement of citation would therefore burden OE and Latin quite a bit. BenjaminBarrett12 (talk) 16:52, 13 April 2012 (UTC)

If every Latin and OE word were sent to RFV, then citations would have to be provided for each word. (I assume that citations will be added at some point.) How about a grandfather clause that says new entries with only one source have to make that explicit? BenjaminBarrett12 (talk) 17:51, 13 April 2012 (UTC)
I'd rather not burden the extinct languages at all. I would have expected some of our Latin and Old English editors to speak up in opposition to such a burden, and from the comments above, I'd sooner guess silence reflects a lack of attention to this discussion than that it reflects consent, but silence is hard to interpret. - -sche (discuss) 04:33, 14 April 2012 (UTC)

Limited availability of sources[edit]

Μετάknowledge says above that while Tok Pisin might have newspapers and books available, they are not readily available for people who do not live in Papua New Guinea. I think the same situation probably exists for many languages in Africa, for example, where languages are spoken by hundreds of thousands of people but with few written sources outside of the geographical error. I admit I do not have any other specific languages in mind, but if we can address both situations, that would save trouble for later. Is this an adequate reason for lowering the required number of citations to one? BenjaminBarrett12 (talk) 16:49, 13 April 2012 (UTC)


The current text seems to have been written with the assumption that {extinct languages} is a subset both of {languages with limited documentation} and of {languages without a strong written tradition}. That's not an assumption I share, and it doesn't seem to be essential to the intent of the proposal; I think the text should be changed so as not to depend on it. —RuakhTALK 00:18, 14 April 2012 (UTC)

Extinct languages (category 1) necessarily have a limited amount of documentation, whether large or small, because their written material is finite. "Languages without a strong written tradition" is category 3; while an extinct language might fall under category 3, that is not assumed. BenjaminBarrett12 (talk) 01:12, 14 April 2012 (UTC)
In ordinary usage, "limited" implies "small"; and I would definitely oppose this proposal if I thought that it would be applied in such a broad way as to take "limited" to mean simply "finite". (After all, there's a finite amount of documentation of English, too: it's a continually-growing amount, but it will never be infinite.) Fortunately, no matter what you say here, I doubt it would be applied in the way that you suggest.
Category 3 is not "languages without a strong written tradition": it's "other languages without a strong written tradition". Hence the title of this section. :-)   This tends to imply that extinct and endangered languages also don't have strong written traditions. (Consider the phrase "my brother Ted, and other idiots"; would you agree that it presupposes that the speaker's brother Ted is an idiot? This presupposition has finite scope — "me, my brother Ted, and other idiots" does not necessarily presuppose that the speaker is an idiot — but I think the natural reading of the proposed text is as assuming that all extinct languages have "weak written traditions" — for example, that Ancient Greek and Middle English have "weak written traditions" — and therefore that the rule would, a fortiori, apply to any language whose written tradition is weaker than that of Ancient Greek or weaker than that of Middle English or weaker than that of some other extinct language.)
RuakhTALK 02:35, 14 April 2012 (UTC)
I dropped the word "other" like a hot potato! I appreciate the explanation; that was certainly not intended.
As for "limited," do you have a suggestion for a different word? I used "sparsely" for the last vote, but that didn't work, either. All three language groups lack, in some sense, an abundance of written material; even extinct languages with a lot of written material has hapax legomena and other situations where words that should be included just don't have adequate documentation. In this respect, they need special consideration that more robustly documented languages do not need. "Languages with special documentation needs"? BenjaminBarrett12 (talk) 03:02, 14 April 2012 (UTC)
Re: "other": Thanks!   Re: "As for 'limited,' do you have a suggestion for a different word?": I'll think on it, and comment back. —RuakhTALK 03:33, 14 April 2012 (UTC)
"Languages with limited documentation" only really distinguishes the last category, namely "Languages without a strong written tradition" (the other categories can be more directly distinguished as "extinct" and "endangered"). Rather than trying to replace the extinct-languages criterion with a broad "languages with limited documentation" criterion that would have "extinct languages", "endangered languages" and "languages with limited documentation / without a strong written tradition" as 'subsenses', we could keep the extinct-language rule intact and add the other two, like:
“Attested” means verified through
  1. Clearly widespread use, or
  2. Usage in a well-known work, or
  3. Usage in permanently recorded media, conveying meaning, in at least three independent instances spanning at least a year, or
  4. For terms in extinct languages: usage in at least one contemporaneous source, or
  5. For terms in endangered languages — those in danger of becoming extinct, such as those listed by an institution such as UNESCO (Interactive Atlas of the World’s Languages in Danger) or the Living Tongues Institute for Endangered Languages, and dialects of those languages — usage or mention in at least one appropriate durably archived source, or
  6. For terms in languages without a strong written tradition [] , usage or mention in at least one appropriate durably archived source.
(or we could update the extinct language criterion from "usage in at least one contemporary source" to "usage or mention in at least one appropriate durably archived source", but you get the idea: we end up with six rules, rather than four, and we can add what's left of the "Languages with limited documentation" section, namely "Languages falling under [] sole source of attestation.", below "please include the ISBN.", with or — my own preference — without its own section header) - -sche (discuss) 04:54, 14 April 2012 (UTC)
PS, if the vote were to pass as it is currently written, where would you put the "Languages with limited documentation" section? and what part of "For terms ... archived source" will be hyperlinked? Those minor details aren't specified, AFAICT. - -sche (discuss) 04:54, 14 April 2012 (UTC)

This is excellent. I have minor objections, but I have minor objections with my own proposal, LOL. I was about two days from abandoning the non-endangered languages with limited written documentation as Anatoli so generously suggested, but I think this revision will pull everything together. Among other things, I think it neatly eliminates the problem above of what to use instead of "limited."

I still have an inclination to include the requirement for providing the citation for all words based on a single mention. That seems like a reasonable compromise because words in extinct languages with only one mention are not currently allowed. There is no way to truly ascertain which are based on a single mention and which are based on more than that, but it probably applies to some words currently listed in Ancient Greek, Latin and other languages. Does including this requirement seem acceptable?

(Although it now seems moot, re the PS, my intention had been to hyperlink the words "Languages with limited documentation" and put the section under Languages to include, probably before "sign languages." FWIW, I left that specific language out for simplicity sake as it seemed fairly straight-forward, but maybe that was naive.) BenjaminBarrett12 (talk) 07:08, 14 April 2012 (UTC)

Re "no way to truly ascertain...": correct, there is no way, so I question whether or not such a requirement is 'meaningful'/'sensible' in the philosophical sense; I've commented about that in a section below.
General comment: I made some changes here (and undid them, pending discussion). In making those changes, I came up with what I think is a compact way to grant a general dispensation to creoles, pidgins etc, but provide for revocation of the dispensation in the case of specific creoles (and I shunted Tok Pisin and all other specific languages onto that subpage). - -sche (discuss) 04:10, 15 April 2012 (UTC)

Extinct languages and mention[edit]

This proposal enables one mention to be sufficient for extinct languages. The previous consensus was that we do not want to allow a single mention for extinct languages: Wiktionary:Votes/pl-2011-05/Attestation_of_extinct_languages. No need to allow mention for extinct langauges has been demonstrated. As soon as we allow a single mention for attestation, all dictionary-only words for Latin and Ancient Greek will be included in Wiktionary mainspace, including those present in only a single dictionary. Furthermore, the keyword contemporaneous has been omitted from this proposal, replaced with the unspecific appropriate, which only says that there is some further requirement a source has to meet, but the requirement has been left unspecified.

As someone pointed out in the previous vote (Wiktionary:Votes/2012-04/Languages with limited documentation), endangered languages are probably better treated separately from extinct languages such as Latin and Ancient Greek, as the written corpus of Latin and Ancient Greek is still quite good for use-attestation. The current regulation for extinct languages does not seem to need any fixing, as far as we know, so should be better left alone.

I think mention should be only allowed--if at all--for languages that are mostly attested only using mentions. As soon a sizable part of a language's vocabulary can be attested in use, mention should be disallowed for that language.

However, any attestation by mention abandons in part Wiktionary's focus on use. When mentions are deemed sufficient for languages with poor or missing written corpus, this is just one step away from deeming them sufficient for PIE: PIE terms can be attested using mentions from a single source, and PIE has poor written corpus (in fact, none at all). I do understand that PIE is an academic invention, unlike endangered languages with no written corpus. In any case, any consideration of mention, especially merely single one, threatens to open gates for unreal things to enter Wiktionary mainspace.--Dan Polansky (talk) 07:49, 14 April 2012 (UTC)

Thank you for this feedback. All of these issues have been on my mind as well, and a lot of this proposal was intended to address them. As seen under 'Mentions such as the following would be allowed,' there are words from Ancient Greek, Latin and Sanskrit that violate the "no one mention" rule. In particular, many Sanskrit words appear to rest on single mentions, so I am doubtful that 'No need to allow mention for extinct langauges has been demonstrated.' I think that excluding mentions is also a major problem for Dacian (use or mention in May 2011 vote).
To summarize, prohibiting single mentions excludes languages such as Dacian and Nauru Pacific Pidgin, Makah and Zarma (see the project page), and allowing single mentions without restriction opens a Pandora's box of undesirable words. FWIW, we see that despite the CFI, Ancient Greek and Sanskrit, and probably Latin and Old English have words based on single mentions. My solution is the requirement 'The community of each language qualifying as having limited documentation should maintain a list of durably archived sources deemed appropriate as a sole source of attestation.'
The word "appropriate" was used in place of "contemporaneous" to allow mentions, and the meaning of "appropriate" is left to the community to determine. (I really like the word "contemporaneous," and one possibility is to use language along the lines of "contemporaneous usage or appropriate mention." Would that be better?)
With respect to separating endangered languages and extinct languages, I think that will happen. sche has proposed making separate criteria for: extinct, endangered and limited documentation languages. See the conversation above beginning at 'Rather than trying to replace the extinct-languages'. Also in that conversation is the proposal to add a requirement that all entries based on single mentions include the citation, which will affect extinct languages in cases where there is only one mention. (From the small sample I looked at, Sanskrit in particular seems to include the citation.)
As for 'As soon a sizable part of a language's vocabulary can be attested in use, mention should be disallowed for that language,' that is the purpose of 'Languages falling under these categories may nevertheless be excluded by vote if judged as having adequate appropriate durably archived sources.' (There is some discussion of this above under 'inappropriate as a sole source.')
I hope this explanation alleviates your concerns for this proposal. A perfect solution is not possible, but I hope this proposal (with some further modifications) addresses the issues well enough to make Wiktionary a more inclusive place without adding a lot of chaos. BenjaminBarrett12 (talk) 09:21, 14 April 2012 (UTC)
Re: "I hope this explanation alleviates your concerns for this proposal." It does not. I am going to oppose. I think that dictionary-only would-be words of Latin and Ancient Greek should be excluded. Thus, θεπτάνων, Φασηλίτης, and Χάλυψ should be excluded unless attested in use; ditto for Latin absisto and adamas; I am referring to examples that you have mentioned in the vote. However, attested in use is not the same as having the use citations available in Wiktionary. Thus, many attested words do not have the requisite quoations in Wiktionary. Some of the examples that you have mentioned are probably attested; see e.g. google books:"Χάλυψ" and [3] in particular.
On mentions, see also Wiktionary_talk:Votes/pl-2011-05/Attestation_of_extinct_languages#Use or mention thread in particular, with its 1295 words. --Dan Polansky (talk) 11:00, 14 April 2012 (UTC)
If that's your only concern, I think that sounds great, then! I was trying to accommodate all those mentions, particularly for Sanskrit, but if people are willing to exclude those, the mention provision for extinct languages can be dropped :) Dracia can be voted in as a language without a strong written tradition at a later point if people are interested. BenjaminBarrett12 (talk) 16:32, 14 April 2012 (UTC)
I am not qualified to say, but perhaps θεπτάνων and Sanskrit can be permitted with the second criterion "Usage in a well-known work" (and perhaps that is a decision already made). That would put to rest my last concerns about extinct languages. BenjaminBarrett12 (talk) 18:13, 14 April 2012 (UTC)
I've pointed the only editor I know (think?) favours including mentions, Atelaes (talkcontribs), in this direction... but it is probably best to make changing to the extinct-language criterion a separate vote, because the main point (as I understand it) of this vote is to relax the criteria for languages that haven't written much or may soon write no more, and adding in changes to the extinct languages criterion risks trying to do too many different things at once, and garnering the opposition of sōlō ūsū editors. PS, I don't think θεπτάνων can be permitted per "Usage in a well-known work" if it isn't used in that work... ;) - -sche (discuss) 03:10, 15 April 2012 (UTC)
Oh, right. Sanskrit doesn't work, either, for usages. Of the first eight words in the middle column at Category:Sanskrit_nouns, three cite usage in the same dictionary, and the other five have no citation. That's not a statistically valid sample, but it seems likely that there is a major problem. However, I am more concerned about trying to usher this proposal to a successful vote and would rather drop extinct languages (in general) for the reason you cite. BTW, I said above that Dracia could be voted at a later point in time, but I think I will add it to this vote along with an assortment of other languages and categories currently being drafted. I don't think mentions for Dracia will be particularly controversial. BenjaminBarrett12 (talk) 05:30, 15 April 2012 (UTC)

To begin with, I think we have to think about the purpose of including usages, and excluding mentions. For starters, let me just clarify that I support excluding mentions for English terms, but I also support including *some* mentions of Ancient Greek. I think that our views on usages and mentions are intended to support our ultimate goal of providing as accurate as possible of a description of some element of some language to anyone who's interested in knowing. When we exclude English dictionary-only terms, we are (in my opinion) doing our readers a service, by not incorrectly asserting some word is a part of the English language when it, in fact, is not. If such words actually were a part of the English language, they would probably be possible to cite. However, it should be borne in mind that we could be making a mistake. Some of these dictionary-only terms might actually be a real part of the English language, or they might have been at some time, or in some place. We're taking our best guess, based on the expansive corpus of English, that a word not showing up in such a corpus is a good indicator that it's not really in use. This is a very good guess.

Ancient Greek has a very large corpus, especially compared to other languages of similar age. We can cite a great many words based on usage from this corpus. However, it really needs to be stated, as some editors seem to be missing this point, it is infinitesimal compared to the English corpus, or the corpus of any modern language used in a first-world country. Seriously, the entire Ancient Greek corpus would not even be a drop in the bucket of today's English corpus (i.e. the addition to the English corpus produced only today). While we should thank our lucky stars that we have as many Ancient Greek works as we do, we would be incredibly naive to think that it provides us with a more or less comprehensive account of the Ancient Greek language, like the English corpus does for English.

So....if we see a word in a modern English dictionary, but not in its general corpus, we are reasonable in thinking that the dictionary got it wrong. When we see a word in an Ancient Greek dictionary, but not in its general corpus, we are unreasonable in thinking that the dictionary got it wrong. Most likely, words mentioned in an Ancient Greek dictionary (to clarify, that is an Ancient Greek dictionary written in Ancient Greece, not an Ancient Greek dictionary written now) were actually a genuine part of the language, and simply not attested. To take a parallel, if I claimed the existence of a 30 foot animal living in North America 70 million years ago, based on genetic evidence, it would be a reasonable assertion (not necessarily true, but reasonable), even if we did not find it in the fossil record. However, if I made the claim for a 30 foot animal living in North America now, but no one had ever seen it, that would be unreasonable.

To finish this all off, I'm not writing entries up for any Ancient Greek word that anyone anywhere mentions. I am allowing mentions from a sole work, the dictionary written by Hesychius, for the reasons I've previously stated. I should also mention that I'm not doing this of my own accord. Quite frankly, writing these mention-only entries generally takes a lot more work than regular entries, and I don't really like doing them. I'm doing them because other editors are requesting them, because they are important words. -Atelaes λάλει ἐμοί 02:32, 19 April 2012 (UTC)

requiring citations[edit]

If you want to require citations be provided at the time of entry creation (postscript: as suggested above, but now discussed in this dedicated section) for some categories of languages, I note that there is currently no language in the vote which requires such a thing. - -sche (discuss) 03:34, 15 April 2012 (UTC) To rehash the preceding sections for those just coming to this one: it's been suggested that we require pre-citing not only of entries in the endangered and sparsely-written languages for which we are otherwise relaxing our CFI, but also of entries in extinct languages. - -sche (discuss) 04:58, 15 April 2012 (UTC)

My thought is to make it obligatory for all entries based on a single mention or usage. Perhaps even a disclaimer like the 1913 Webster's template can be added so people know the term does not have a solid basis. This would also make it easy to find terms based on work X if it turns out work X is unreliable (such as happened in the Aleut case). --BenjaminBarrett12 (talk) 03:46, 15 April 2012 (UTC)
Right, but do you want to make that obligatory by adding a sentence to CFI, or by RFVing entries that are created without citations? I'd actually tend to favour the second option, because it's the only way a CFI requirement like "All entries attested in only one work must cite that work" could be enforced, anyway. Think about it: someone could create a word with no citations, and argue that the word was used in two (or three, or four, or fifty) works, and thus wasn't required to cite a single source. The only way to attempt to prove that the word was used in only one work would be to RFV the term: and if RFV found only one citation, the term would pass as cited... so why not omit the requirement from CFI, and just use RFV (and existing policy) from the start as the mechanism for requiring entries to have citations? That is definitely compatible with having a different template for each work: Ancient Greek already has templates like {{grc-cite|Homer|Iliad}}, which we could easily turn into e.g. {{grc-cite-Homer-Iliad}} to make it easy to use Whatlinkshere to find the words based on each work. - -sche (discuss) 03:55, 15 April 2012 (UTC)
I considered that quite a bit, too. The reason for including it is to put people in the practice of doing so. It is not possible to prevent people from breaking the rules if they don't want to follow them, but most people want to be good netizens, and this sets a standard for them to follow. Moreover, the amount of work involved in doing this is minuscule compared to the benefit: When someone is entering (with proper permission) Dr. Livingston's Swadesh list for Zarma, they start by creating a template and then just paste it in each time with the rest of the entry structure. My concern is two people entering words from multiple vocabulary lists and not leaving a trace of where they come from. Then if a case like Aleut occurs, it's really difficult to resolve the situation.
As to existing policy, is there something that requires citations? I see a lot of words without them. --BenjaminBarrett12 (talk) 04:13, 15 April 2012 (UTC)
There have been a few discussions of requiring citations be provided at the time of entry creation for a few narrow categories of entries, but I don't think any have passed(?) and in general terms are certainly only required to be citable, not cited: we assume good faith and our faith is borne out by the experience that most of the terms people enter, in extinct languages and in general, are valid (citable).
Hm, perhaps the requirement could simply be that words in whatever categories (endangered languages, sparsely-written languages, perhaps extinct languages) are required to cite at least one work. That says nothing about how many works the words are used in (could be just the one, could be two—whether the creator of the entry knows about the second or not), and thus elegantly avoids adding a philosophically non-sensible rule to CFI. - -sche (discuss) 04:35, 15 April 2012 (UTC)
At this point, I'm basically considering the inclusion of extinct languages to be, err, dead. The requirement of _at least_ one citation is elegant, thank you. I'll make this change, the deletion of the extinct language category and the addition of specific languages and language categories in the next day or two. BenjaminBarrett12 (talk) 05:34, 15 April 2012 (UTC)
This presents a real problem for Latin. Webster's often gives Latin etymologies, but the 1913 edition includes mention of Latin terms and forms that aren't actually attested in any Latin source. In some cases, the Latin is a reconstruction, and in other cases it's an unattested form of a word known only in a different form (e.g., an infinitive inferred from a 3rd-person form). So the first problem I see with appyling the "one use or mention" criterion is that we'll get "citations" copied from sources now known to be inadequate for forms that we do not know existed in the language. So, as before, a broad concept covering all extinct languages is probably too much for what is trying to be achieved. If extinct languages can be left out of the current process, then the suggestions look promising. --EncycloPetey (talk) 07:41, 15 April 2012 (UTC)

Proposed languages for inclusion[edit]

Above, sche made some temporary changes as a proposal, saying: "In making those changes, I came up with what I think is a compact way to grant a general dispensation to creoles, pidgins etc, but provide for revocation of the dispensation in the case of specific creoles (and I shunted Tok Pisin and all other specific languages onto that subpage)."

User:Atitarev, User:Metaknowledge and I have compiled a list of languages that should be included on that subpage. The languages of Africa are not well represented - can anyone familiar with African languages provide some guidance? Feedback on the rest is welcome as well, of course :)

The following natural languages:

  1. endangered languages - languages in danger of becoming extinct such as those listed by an institution such as UNESCO Interactive Atlas of the World’s Languages in Danger, the Living Tongues Institute for Endangered Languages or the Australian Indigenous Languages Database and dialects of those languages;
  2. non-Indo-European languages of the Americas, Australia and Oceania, excluding Guaraní;
  3. pidgins and creoles;
  4. the following languages of Africa: Khoisan languages, Wide Grassfields languages, Zarma;
  5. Dravidian languages, excluding Malayalam, Tamil and Telugu;
  6. Tibetan languages;
  7. North Caucasian languages;
  8. languages of Southeast Asia and the Formosan languages, excluding Cantonese, Indonesian, Malay, Standard Mandarin, Thai and Vietnamese; and
  9. Äynu (aib), Andamanese languages, Assamese, Dacian, Dhivehi, Guernésiais, Hunsrik, Jèrriais, Kartvelian languages, Kokborok, Kven Finnish, Lepcha, Meänkieli, Meitei, Mizo, Sercquiais and Sinhalese.


  • The last category is just a catch-all for languages not fitting in elsewhere.
  • BTW, I dropped patois because it has a number of meanings, and the relevant meanings appear to be covered adequately by pidgin and creole.
  • Although I specifically excluded Arabic, I did not exclude European languages because it got messy: Guernésiais, Jèrriais and Sercquiais are European. I think it's reasonable to expect the "languages of the Americas" to be interpreted as not including English and French, for example.

--BenjaminBarrett12 (talk) 05:41, 15 April 2012 (UTC)

You needn't listed any endangered languages on the subpage, because they're already covered by the dedicated rule that all endangered languages have relaxed CFI (the same relaxed CFI as the languages on the subpage). :) For the same reason, you needn't put the extinct languages of America and Australia on the list... unless you're trying to work them into the "can be attested by one mention" category. And you needn't list any pidgins, creoles or koines on the subpage, because they're already blanketly-included by the text of the rule... the subpage should just catch everything those rules can't. (Alternatively, take the blanket inclusion of creoles out of the rule, and move it to the subpage.) As for the rest: my suggestion is to propose in this vote only specific, named languages that you can defend the sparse attestation of (Tok Pisin, Sinhalese, etc), and try to vote in big categories like "the languages of Oceania" later. Remember, there's a mechanism for adding to the list, so there's no pressure to come up with a complete list before this vote, and vagueness (as I learnt in the romanisation votes) tends to make people uneasy. - -sche (discuss) 06:39, 15 April 2012 (UTC)
That's a really good idea. The lines for Africa through Southeast Asia can each have individual votes (they may need them; I've noticed a lot of linguistic nationalism in relation to this sort of thing). After all, we can always add more... --Μετάknowledgediscuss/deeds 06:48, 15 April 2012 (UTC)
I keep going back and forth as to whether put the entire list on the subpage. It seems like extra work to go from the CFI list to the "limited languages section" to the subpage, everything should just be grouped together in one place, but if nobody thinks it's a big deal, it is fine. The reason for extinct languages in the Americas and Australia is, yes, for single mentions for languages like w:Eyak_language that recently went silent. I still need to look more closely at sche's proposal for voting languages as having a strong written tradition, but any rule like that should apply to all languages. Any language on the list is capable of gaining a strong written tradition; Hawaiian, Navajo and Cherokee are three in particular that are candidates now or in the near future. Please note I made changes to the African languages so that only the Khoisan languages and Zarma are included now. --BenjaminBarrett12 (talk) 07:52, 15 April 2012 (UTC)
My reasoning (for listing languages on a subpage) is: rules like "For terms in endangered languages — those in danger of becoming extinct, such as those listed by an institution such as UNESCO (Interactive Atlas of the World’s Languages in Danger) or the Living Tongues Institute for Endangered Languages, and dialects of those languages — usage or mention in at least one appropriate durably archived source" and "For terms in languages without a strong written tradition — languages named in the list of languages without a strong written tradition, and all pidgin, creole, koiné and patois languages with ISO codes, except those deemed by vote to have a strong written tradition — usage or mention in at least one appropriate durably archived source" are reasonably compact, so we can go ahead and list them on the main page. The list you have above, on the other hand, necessarily takes up a lot of space. It will take up even more space as it becomes more complete. To make it straightforward to look at the list and determine whether or not a language is on it, and so everyone knows precisely which languages they're voting in, I would even prefer that the list name every language it affects, rather than trying to save space by referring to categories. - -sche (discuss) 05:04, 16 April 2012 (UTC)
That makes a lot of sense for the sub-page. Excluding the endangered languages, referring to all the languages would still likely be a momentous task, probably numbering in the hundreds if not a thousand or more. But it might be possible to create a registry of sorts so everyone can see which languages are using the one-citation criterion. My impression is that when someone wants to add words in a new language, an administrator adds the namespace and there is no formal process. If a process were created, then it would be fairly easy to track. Alternatively, a requirement could be made that people sign their language up before using the criterion. --BenjaminBarrett12 (talk) 06:26, 16 April 2012 (UTC)
Please remove Kannada from Dravidian languages as well. It's even on Google Translate with a fairly good transliteration. kn:wiki is also quite developed. --Anatoli (обсудить) 02:54, 26 April 2012 (UTC)
Also, Sinhalese appears on the list as if it's Dravidian. It's Indo-Aryan like Hindi but the documentation/resources are very limited, so it should be listed but in a different place. --Anatoli (обсудить) 02:57, 26 April 2012 (UTC)
The intention was geographical region, but you're right: It does look like that. Do you have a suggestion? --BenjaminBarrett12 (talk) 03:32, 26 April 2012 (UTC)
I didn't make any lists of languages requiring more attention, I just mentioned some languages. I'll leave the wording to you. I personally don't see the need in grouping them by language families, it's unrelated to the issue at hand. An endangered language can belong to any group. Where you place them in the page is not critical, IMHO. --Anatoli (обсудить) 03:55, 26 April 2012 (UTC)

Presence of durably archived content online.[edit]

Anatoli's comments above leave me thinking that although the total amount of documentation for a language — number of speakers, strength of written tradition — is logically relevant to our attestation criteria, it's not really what's pragmatically relevant. I think that the main factor in citeability is really the amount of durably archived content that is available online. A language that has a very small written corpus, all of it online, is really in no worse a position, as far as WT:CFI#Attestation and WT:RFV are concerned, than a language that has a moderately large written corpus, none of it online.

Would it be too drastic a change in scope to suggest we focus on this instead? :-/

(To be sure: I've often added citations from print sources that aren't available online, and I'm sure I'm not alone, but there the sequence of steps is always the wrong direction from what RFV needs: if I come across an interesting usage in something I'm reading, I might look up our entry and see if it needs quotes, but if I come across an entry and see that it needs quotes, I could never find those quotes in something not available online.) —RuakhTALK 03:36, 16 April 2012 (UTC)

Yes, practically, it's much harder to provide citations for such languages, let alone prove with a dictionary or grammar reference in English. If the situation changes or a language is proved no longer be difficult to provide citations for, then we could review. Like Stephen mentioned, Vietnamese used to be very cumbersome only a decade ago. It would be too hard to find enough online material for it. It's no longer the case in my observation. --Anatoli (обсудить) 04:46, 16 April 2012 (UTC)
This is very much a problem for languages written in non-Latin scripts. These aren't usually picked up by engines like Google Books, so it becomes really hard to find durably archived citations. -- Liliana 04:34, 16 April 2012 (UTC)
It works for Russian, Arabic, Chinese, Japanese, Korean, Thai, Persian, Hindi, etc. (with some limitations) . Note that Google is not so flexible in finding Russian inflected forms. Yandex is more adjusted to the Russian grammar. Perhaps the way Google searches work for some languages needs improvement. It works quite well with Mandarin, Korean, Japanese, Korean, Arabic: it ignores spaces in Korean quoted strings for examples, so words with and without spaces can be equally picked up, it handles well alternative spellings in Japanese (you can type a word in Hiragana to find a word in Kanji), searches both Chinese trad./simplified character sets. Persian words with w:zero-width non-joiner and knows both strict and relaxed Arabic spelling. It doesn't understand alternative Hindi spellings (गरीब and ग़रीब is the same word in Hindi). --Anatoli (обсудить) 04:46, 16 April 2012 (UTC)
I think this is a very important point because it goes to the focus of Wiktionary. As for the list, that actually is an important part of how it came together. For example, the language selections of Southeast Asia and the Americas include languages with plenty of documentation, but not a lot online. (I am pretty worried about the wide scope of the languages of the Americas as I'm afraid people might object that the scope is too wide. --BenjaminBarrett12 (talk) 05:38, 16 April 2012 (UTC)
Do I understand correctly that you've already started to take "languages without a strong written tradition" as meaning "languages without a strong tradition of digitizing written works"? In that case, I definitely think the proposal needs to be explicit about that! —RuakhTALK 20:25, 16 April 2012 (UTC)
Do you oppose such a definition? If the general mood is the same, I personally see no benefit of the vote to my efforts. I don't have access to libraries in Colombo or Vientiane. --Anatoli (обсудить) 01:46, 17 April 2012 (UTC)
Sorry, I don't really understand your comment . . . I'll clarify my views, and hopefully that will answer your question. Firstly: As I said above, I think it's more practical to use "how hard is it for Wiktionarians to find valid cites in the language?" as a criterion than "how many valid cites in the language theoretically exist?". Secondly: I would oppose having WT:CFI claim that our criterion is "how many valid cites in the language theoretically exist?" if our criterion is actually "how hard is it for Wiktionarians to find valid cites in the language?". That is: I oppose lying. If we want a certain criterion, then we shouldn't modify WT:CFI to list a different criterion that we don't actually intend to abide by. —RuakhTALK 02:11, 17 April 2012 (UTC)
I also find it difficult to understand you but it must be my English. I don't suggest lying. I suggest the online availability for a language should be one of criteria for relaxed treatment of some languages. If we have editors equipped with printed books, it's fine but most editors use the most convenient sources, the web. The Wiktionary content for languages with low internet penetration and learning resources could be boosted if we allow to treat them as languages in danger (even if they are not really in danger) as a temporary measure. I haven't created this vote, so it's only a wish, if people oppose this, I will drop my request. --Anatoli (обсудить) 02:24, 17 April 2012 (UTC)
Re: "I suggest the online availability for a language should be one of criteria for relaxed treatment of some languages": Yes, so we're in agreement. —RuakhTALK 02:44, 17 April 2012 (UTC)
Ruakh, would you support this proposal if it were worded as proposed below? Do others also support such wording? --BenjaminBarrett12 (talk) 02:16, 17 April 2012 (UTC)
Sorry, I see a bunch of wordings proposed below; which one, specifically, are you referring to? Or do you just mean, more generally, would I support with a wording along the lines of one of those? If the latter, then — yes, I think I would. —RuakhTALK 02:44, 17 April 2012 (UTC)
Thank you for the response. If others voice approval (or are silent), I will redraft the proposal with revised wording and see if it looks like that will work. Hopefully just one more round or so before actually voting :) --BenjaminBarrett12 (talk) 02:51, 17 April 2012 (UTC)
I included most Southeast Asian languages and Sinhalese as per Anatoli and creoles that include Tok Pisin as per Μετάknowledge. I'm not completely comfortable with those changes because, as you basically say, they are not "languages without a strong written tradition." My plan is to rewrite the proposal to incorporate all of the discussions above, but I would like to wait until consensus is reached on this issue.
Either the wording "languages with limited documentation" (and "strong written tradition") should be changed, or the Southeast Asian languages, Sinhalese and Tok Pisin should be dropped (and added later in a different vote). Either way, the wording needs to be modified (or an additional criterion added) because it won't be appropriate to add those languages to the sub-page as currently described. Therefore, I think there are two questions that need to be addressed, one following from the other:
  1. Is there consensus that online documentation is really what the bottom line is for inclusion?
  2. If so, what wording would be acceptable? Here are some options:
  • languages without strong online durably archived content,
  • languages with limited online documentation,
  • languages with limited readily available documentation,
  • languages not easily documented.
The last is my favorite, in part because I think the name of the proposal would not have to be changed. Of course, the sub-page with the list of languages and language categories will still be kept. --BenjaminBarrett12 (talk) 21:26, 16 April 2012 (UTC)
Are you still intending to have a list of which languages fit the above descriptions? If not, they're all far too vague. And are you intending that a descriptor like "not easily documented" replace "without a strong written tradition" (i.e., that we'll vote in two new rules, one relaxing CFI for endangered languages one one relaxing CFI for those which are not easily documented), or are you intending that we vote in only the broad relaxation of CFI for languages which are not easily documented (given that endangered languages are a subcategory of languages which are hard to document)? Also, I think "languages which are not easily documented" is perhaps too broad a phrase. Several of the languages Stephen mentioned are very well documented (there are hundreds of thousands of manuscripts and handwritten books, etc), and very easily documented (it is easy to write them down by hand, thus documenting them); the key is that there is little information available in them online. - -sche (discuss) 03:11, 17 April 2012 (UTC)
My hope is to wrap endangered and limited documented languages in one line in the CFI criteria, right after the extinct languages criterion. I'm pretty sure that everything except the description of the languages is the same, so just one line would be the most efficient. (Then the additional section with the explanation and the sub-page with language categories and specific languages.)
You are right about "ease of documentation." I suppose the description needed is "languages with limited online documentation."
If that sounds like it will earn support, then I will redraft the proposal. --BenjaminBarrett12 (talk) 03:31, 17 April 2012 (UTC)

Adding the word "online"[edit]

I haven't addressed all the issues raised on this page, yet, but I have added the word "online" and deleted the extinct category in a draft available for viewing/commenting at User:BenjaminBarrett12/scratch2.

So the wording is now "languages with limited online documentation."

Does the title of this vote need to be reworded to include "online"?

Also, I plan to rewrite the first few sections on this page that lay out the background of this proposal. When I do that, I plan to write something like, "The sections Background through Further reading below have been rewritten. Please see (earlier version link) for the earlier version." Is that acceptable? --BenjaminBarrett12 (talk) 18:24, 18 April 2012 (UTC)

It's not so much a matter of whether there's documentation online, IMHO, as of whether there's "durably archived" or "permanently recorded" content online. (And searchably so.) I'm starting to wonder if instead of adding a new attestation mechanism to the beginning of ===Attestation===, it might work better to add a new subsection ====At least three==== in the right position. That subsection could then clarify that "at least three" is designed for languages that have significant amounts of permanently recorded content available and searchable online, and then go on to describe the only-one-usage criterion (using the same sort of logic as is currently in the vote). I dunno. What do you think? —RuakhTALK 19:45, 18 April 2012 (UTC)
FWIW, the wording of the proposal is: "For terms in languages with limited online documentation: usage or mention in at least one appropriate durably archived source." Perhaps the "durably archived source" part needs to be in the first clause. As to the rest, I am creating a new section with a new name. --BenjaminBarrett12 (talk) 21:46, 18 April 2012 (UTC)
Re: "I am creating a new section with a new name": Yes, that new section what I was referring to when I wrote of "the same sort of logic as is currently in the vote". —RuakhTALK 13:24, 19 April 2012 (UTC)
Okay. I'll draft the rest and see if that's acceptable. --BenjaminBarrett12 (talk) 16:27, 19 April 2012 (UTC)

The number of usages[edit]

In the section above, Ruakh discusses creating a new section that specifies how many citations are required. Before writing that, I would like to discuss how the four criteria would be changed.

Using boldfacing to highlight the key points, the attestation section currently says:

“Attested” means verified through
  1. Clearly widespread use, or
  2. Usage in a well-known work, or
  3. Usage in permanently recorded media, conveying meaning, in at least three independent instances spanning at least a year, or
  4. For terms in extinct languages: usage in at least one contemporaneous source.

Some proposed changes:

  • The number of usages could be dropped in 3 (the number in 4 is somewhat redundant)
  • Also, if this section is to be changed, I would like to address the wording of "verified through usage..." It is not verification through usage, but of usage.
  • Usually, I see people talking about "durably recorded" or "durably archived," so that should perhaps be altered.
  • Some of the instances of "or" could be dropped.

That results in

“Attested” means verification of:
  1. Clearly widespread use,
  2. Usage in a well-known work,
  3. Usage in durably recorded media, or
  4. For terms in extinct languages: usage in a contemporaneous source.

Would this, followed by the section of how many usages for languages with and without adequate available documentation, work? (That section would link to the list of languages, including endangered languages, creoles, etc.) --BenjaminBarrett12 (talk) 22:09, 18 April 2012 (UTC)

Text for number of citations[edit]

Here is the text I drafted for a section discussing how many usages or mentions are needed, and what special provisions apply to terms with less than three usages:

In general, three citations in which a term is used are considered adequate for inclusion on Wiktionary. For languages with limited documentation and languages with limited documentation available online, however, only one usage or mention is adequate.

For languages with limited (online) documentation, the following provisions should be observed to ensure the accuracy and integrity of Wiktionary content:

  • the language community should maintain a list of materials deemed appropriate as single sources or requiring only one other source,
  • the sources should be listed on the entry or citation page, and
  • a box explaining that fewer citations were used should be included on the entry page.

Does this sound reasonable? For the last provision, I'm thinking that it might be nice for endangered language communities to actually have multiple boxes, saying things like "This term has been verified by two native speakers" to provide an indication of how much likely the entry is to be accurate. --BenjaminBarrett12 (talk) 00:43, 22 April 2012 (UTC)

Re "the language community should maintain a list of materials deemed appropriate as single sources or requiring only one other source": does this mean a community could decide that Book X was not, in and of itself, appropriate as a single source, but that words in Book X required only one other source, e.g. Book Y? If so, given that the rule states that one citation is sufficient to verify a term, why not list Book Y (and any other valid source) as "materials appropriate as single sources", and simply not list Book X? That would make the underlined bit unnecessary. - -sche (discuss) 04:13, 26 April 2012 (UTC)
In a situation where there are only two resources for a languages, but both of them with known errors, the community can say both are required for attestation. I agree the wording is a little unwieldy, but I couldn't think of a better way to put it. --BenjaminBarrett12 (talk) 04:57, 26 April 2012 (UTC)
Postscript: The wording has become more complex, so I reduced it back to one. If the community wants to make a two-source requirement, I don't see why they can't do that. --BenjaminBarrett12 (talk) 06:05, 27 April 2012 (UTC)
Re "French in the United States": allowing not-widely-used regional varieties of widely-used languages is different from allowing not-widely-used languages. In the past, several County Durham English regionalisms by Top Cat 14 (talkcontribs) were deleted because they did not meet CFI: we like to include regionalisms, but if they're so infrequently recorded that they do not meet their languages' general CFI, I argue they're not important enough parts of their languages' lexicon to merit inclusion. If Louisiana French were so different from general French that this would exclude a majority of its lexicon, we would need to consider whether or not to make it a separate language. (Wiktionary does consider Louisiana Creole French a separate language, {{lou}} — but Louisiana French is just {{fr}}.) - -sche (discuss) 06:17, 26 April 2012 (UTC)
I am not knowledgable about French, but I do know there have been attempts in recent years to revitalize French in the US South. I am happy to change "Quebec French" back to "French" if anyone can say with reasonable certainty that nothing is being lost by doing so. I did the undo just to make sure that point is covered adequately :) --BenjaminBarrett12 (talk) 06:46, 26 April 2012 (UTC)
Because of Haitian French and perhaps other varieties (and the possible confusion in the vote), I changed it back to excluding French. If Missouri French or some other variety needs to be included, that decision can be made at some later time. --BB12 (talk) 18:06, 28 April 2012 (UTC)
I was just about to comment (literally, I got an edit conflict, lol) : After a bit more digging, I see that Wiktionary does have a separate code for Cajun French, {{frc}}. But I will seek clarification in the BP of whether it is for etymologies only, or whether it is intended to have L2 header sections etc, and if the latter, I will suggest combining it and {{fr}}. (And as you say, we can add regional Frenches later if the community decides to keep the as separate languages and accord them undocumented status.) - -sche (discuss) 18:12, 28 April 2012 (UTC)
Thank you for the follow-up on French. I think that with the current wording, if it's decided that Cajun French or Missouri French is a language, it would qualify even if not listed as endangered in the UNESCO Atlas or the like. The other languages I was trying to address with the current wording are Pennsylvania Dutch and w:Plautdietsch, the latter of which is spoken in Europe. --BB12 (talk) 20:39, 28 April 2012 (UTC)
I think the various Low Germans, including Penn Dutch (pdc) and Plautdietsch (pdt), should be handled at once, and probably in a later vote, after we've clarified them: in the BP and on RFDO we're still discussing combining some Dutch Low Germans, and considering splitting some German Low Germans, and sorting out the Frisian might-be-Low-Germans. - -sche (discuss) 00:09, 1 May 2012 (UTC)

Making sure -sche's changes are covered[edit]

(Once again,) I think the wording is nearly ready to go to a vote. I want to make sure the changes -sche suggested on 15 April are addressed:

  1. languages voted as qualifying and named in the list of languages without a strong written tradition, and all pidgin, creole, koiné and patois languages with ISO codes
  2. Pidgin, creole, koiné and patois languages with ISO codes may be deemed by vote to have a strong written tradition.
  3. It is to be understood that languages will not be added to the list at WT:CFI/languages without a strong written tradition without a vote, and that a section of that page or another page will record any pidgin, creole, koiné and patois languages with ISO codes which are deemed by vote to have a strong written tradition. (It is also to be understood that languages need not be voted upon individually; entire slates of dozens or hundreds of languages can be voted upon at once.)
  4. The community of each language deemed not to have a strong written tradition must maintain a list of sources deemed appropriate as sources of attestation.
  5. Entries for terms in endangered languages and languages without a strong written tradition must contain at least one citation at the time they are created.

For number 1, I added creoles and pidgins as a basic category, but did not include the ISO code requirement. My thought is that because languages can be added or excluded by general consensus, the ISO code requirement is not necessary. For numbers 2 and 3, I've written the language change to be made by general consensus. For numbers 4 and 5, there are now the following requirements:

  • the language community should maintain a list of materials deemed appropriate as sources for entries based on a single mention,
  • each entry should have its source(s) listed on the entry or citation page, and
  • a box explaining that a low number of citations were used should be included on the entry page.

I think the current wording adequately covers these changes. --BenjaminBarrett12 (talk) 16:22, 27 April 2012 (UTC)

Mention of ldl template[edit]

I added the suggestion of using the "ldl" template, not yet written so people would not be lost on what to do about

  • a box explaining that a low number of citations were used should be included on the entry page (such as by using the Template:ldl template).

I don't think that's controversial. The wording is still being worked out, but of course, it can be modified at any time. I've delayed the vote until 1 May just in case. --BB12 (talk) 21:02, 29 April 2012 (UTC)

For the community: a prototype (mostly by BB) is at User:Metaknowledge/ldl. --Μετάknowledgediscuss/deeds 22:27, 29 April 2012 (UTC)
No longer mainly by me and it looks much better now :) The "edit this page" link is external. I tried to make it internal, but couldn't get it to work. Is it a temporary thing? --BB12 (talk) 22:59, 29 April 2012 (UTC)
Oh yeah, sorry about that. Making an external link was really just a hack on my part (inserting a magic word into a URL). I'll see if I can make it look like an internal link by means of a subtemplate. --Μετάknowledgediscuss/deeds 23:01, 29 April 2012 (UTC)
Problem solved For some reason, {{edit}} is different from WP's vastly superior version. Due to this lack, I copied the code to User:Metaknowledge/edit, and now it correctly displays as an internal link in the LDL template. --Μετάknowledgediscuss/deeds 04:51, 1 May 2012 (UTC)
Thanks to Yair rand for an infinitely more elegant solution. --Μετάknowledgediscuss/deeds 04:29, 3 May 2012 (UTC)

Not how I would do it, I'm afraid[edit]

I'd much prefer to name languages individually (with an ISO 639 or Wiktionary-only code, for extra clarity) than name them by language family or geographical location. While such a list would be large and take time to write, so what? We can start with undisputed cases and then add more on a case-by-case basis. Mglovesfun (talk) 14:32, 30 April 2012 (UTC)

I agree with both parts of your comment: that we should name each covered language explicitly, not just say "the languages in X family", and that we should add only the clearly defensible cases in this vote, and add more languages later, rather than trying to cover everything from the start in this vote. I did suggest both of these things earlier. I understand BB's desire to cover as many languages as possible from the start, but I continue to caution that expansive vagueness is what caused the first romanization-of-extinct-languages vote to be rejected. - -sche (discuss) 23:55, 30 April 2012 (UTC)
Although that makes sense, too, my concern is that the goal of a complete list would eventually comprise more than 6,000 languages, not merely time-consuming, but a mammoth undertaking to create and maintain. (FWIW, the language families are very specific because the Ethnologue lists the exact languages in them.) I don't know if it matters, but Metaknowledge and I in particular spent hours drafting and editing and rewriting the language list; as you can see from the African languages, we tried to be very careful in not listing anything we were not sure of.
Also, under this proposal, the language list is easy to amend as it requires only consensus, not a vote, so I did not think this was an extremely important issue.--BB12 (talk) 01:18, 1 May 2012 (UTC)
If this seems controversial, the first two items (Americas, Australia, Europe and Oceania) can be deleted and discussed in the BP if this vote passes (I would add Hunsrik to the current list in that case). The Southeast Asian item would have to be deleted as well, but I'm less sure about what to add other than Lao. Does that seem significantly better? --BB12 (talk) 02:46, 1 May 2012 (UTC)
Re Gloves: So what? So we'd be wasting the time of a valuable contributor like you who could be doing something of much greater importance here :-).
Re Benjamin: Please, let's not delete any geographical categories. I have been having unpleasant experiences with language nationalism recently, and this may be our best chance to get it in without unpleasantness. --Μετάknowledgediscuss/deeds 03:38, 1 May 2012 (UTC)
Re "mammoth undertaking": surely no more mammoth than trying to define words in all six thousand of those languages... ;)
We could (odd thought though this may be) only add languages as they come up, i.e. as people start adding words in them.
But alright, there's no harm in opening the vote with the familial/geographical categories as-is, and then if it doesn't pass because of concerns like those above, having a third vote on specific languages. (Don't worry, it won't be much work to draft a third vote: just re-use this vote's text, but name the languages.) - -sche (discuss) 04:21, 1 May 2012 (UTC)
Given Metaknowledge's feedback, I'm inclined to agree with -sche and let the vote run as it is. --BB12 (talk) 12:21, 1 May 2012 (UTC)


This change heavily complicates CFI for most languages, as the basics can't be understood from just the top bit since the number of necessary cites for most languages has been removed. I recommend changing the third bullet point to include something like "(usually three, with some exceptions)". The whole sentence "use that conveys meaning or, for certain languages, mention in an appropriate number of independent permanently recorded media spanning at least a year." is also rather difficult to understand. --Yair rand (talk) 09:46, 2 May 2012 (UTC)

I wanted to say something like "usually three," but since there are probably about 6800 languages that this doesn't apply to and only about 100 it does, it seemed misleading (even if true in terms of volume at present). I also dislike the wording in the third bullet point (and didn't like it when I tried rewriting it). How about this "use that conveys meaning (or mention for certain languages) in an appropriate number of independent permanently recorded media spanning at least a year"? FWIW, I think the part beginning "independent" is a little too much, but doing anything with it requires huge restructuring that cannot be handled under this proposal. --BB12 (talk) 18:08, 2 May 2012 (UTC)
How about keeping the original language and just adding a parenthetical: "uses in permanently recorded media, conveying meaning, in at least three independent instances spanning at least a year (different requirements apply for certain languages". That's much simpler. --BB12 (talk) 19:12, 2 May 2012 (UTC)

WT:CFI/ > WT:Criteria for inclusion/[edit]

The vote content currently says that a page will be created at Wiktionary:CFI/Languages with limited online documentation. This is contrary to our usual practice of having CFI pages as subpages of Wiktionary:Criteria for inclusion. Can the link be changed to direct to Wiktionary:Criteria for inclusion/Languages with limited online documentation instead, or is it too late now that the vote is in progress already? --Yair rand (talk) 21:26, 6 May 2012 (UTC)

Sorry about not getting that correct. When the voting period is nearly over, we can start a conversation in the Beer Parlour about changing the directory. We need only consensus to change things like that :) --BB12 (talk) 03:19, 7 May 2012 (UTC)
I see no reason why we can't start that conversation now, but I will leave it to your discretion. --Μετάknowledgediscuss/deeds 18:29, 12 May 2012 (UTC)
It seems reasonable that the vote will likely pass now, so I added that conversation: Wiktionary:Beer_parlour#Hyperlink_change. --BB12 (talk) 05:47, 13 May 2012 (UTC)