Wiktionary:Criteria for inclusion/attestation

Definition from Wiktionary, the free dictionary
Jump to: navigation, search

This page is intended for debate concerning the use of criteria such as google hits, blog entries, IRC logs, emails and other online sources in supporting an entry. There is currently not a strong consensus on this, but there appear to be two schools of thought (proponents of any position are encouraged to refine these characterizations, or add their own if no existing characterization is completely adequate):

  1. Only widely-published usages which can be verifiably attributed to known authors should count. A usage (presumably?) need not be in print to be considered widely published: online editions of major news sources, for example, should generally count.
  2. The underlying problem is to demonstrate that a term is (or has been) used and understood by independent speakers. Any available means may be used to this end, so long as independent use is demonstrated.

Position 1 (published usages only)[edit]

To some, it may seem like published uses only is descriptive enough. However, some clarification has been requested. By published source I generally mean:

  1. Something that can be checked out of a public library, or
  2. Something that can be purchased at a local book store.

This of course, does include internet searches that turn up passages from such books/magazines/movies/newspapers, etc. But as a conceivable, comprehensible rule-of-thumb, "what you can get from a library" should cover it.

Putting this all together: Any word that appears even once in any book that anyone happens to find in any library in the world is fine?

There are several reasons for desiring physical manefestations.

  1. Physical printings have associated cost with mistakes, therefore are often subjected to finer review.
  2. Most items have a tracking authority making it possible to obtain copies of the original if a replacement is needed.
  3. Filtering of nonsense is done (in a very general sense) by the various staff (library staff, book store staff, consumers.)
  4. Publishing is expensive, mandating some editorial review (as opposed to zero review before posting on a blog, or a Wiktionary talk page.)
  5. Sources are easier to date.
  6. Attributions of coinages are easier, as books (unlike the internet) are generally not anonymous.
Even if we agree that published sources are inherently more reliable — and I don't grant that — there is a major practical problem: I claim that the word "frobulative" appears in "The Handbook of Rotative Flocculation", by Edmund G. Brewster, and I found this in my local library. Prove me wrong. -dmh 05:38, 22 Jun 2005 (UTC)
I found my original attempt at this from June 7th...incindiary even by my standards.
As to your contrived example; different cases require different amounts of research. Above I referenced a rule of thumb. Not a strict rule as you made it out to be. In the 'frobulative' example, I would not /expect/ that to be in my local library. If any of my handy dictionaries show that I was wrong, well then, your dubious term would have an additional citation. If not, then I'd be in for an extra trip to the library...something I've been known to do for truly questionable terms here. --Connel MacKenzie 06:25, 22 Jun 2005 (UTC)

By now we've tried a couple of experiments along these lines. For example, I went to a local bookshop and picked up a lovely tome called You are Becoming a Galactic Human, which contains all sorts of interesting material, only some of which I've taken the time to enter.

So far.

This book was physically printed, with the concommitant expense. In fact, a great many copies have been published. It has an ISBN number. It has, according to its own acknowledgements section, been proofread by two separate copy editors. It is dated. It gives the names of two authors (though I suspect at least one is a pseudonym — by no means rare for published works).

In short, it is an impeccable source by the criteria above, certainly far better than the infamous "blogs, emails and chatrooms." There is no shortage of other books like it, not to mention magazines, newspapers, newsletters, posters and who knows what else. Yet, it seems somehow lacking as a respectable source for words or for that matter, other information. In the space of two pages, it defines photon belt as a wholly fantastic object unlike anything currently known to astronomers, misspells toroid and the name of the eminent physicist Paul Dirac, and for good measure predicts an event to take place in 1995 or 1996, which manifestly hasn't happened (or has it ...?)

Clearly, appearance in print does not lend any particular credibility per se.

Naturally, we could limit ourselves to more "respecatble sources" — major newspapers, referreed journals, books published by major publishers and so forth — but fortunately no one has yet put forth a concrete proposal for that. Even if this could be done, I for one would not want to limit my involvement Wiktionary to digging through official sources. In any case, even official sources can make mistakes and may also include combinations of letters that aren't actually words. These would have to be sorted out by criteria very much like those currently on CFI.

In short, adding a requirement for published sources does not inherently make for more reliable citations or eliminate the need for the more generally appicable criteria in CFI, but it does open up room for further dispute over which published sources to refer to and neutralize many of the advantages of working online. It's not clear what as yet unproven improvement would be worth this. If the aim is not to give undue status to terms ("I saw pr0n in Wiktionary so I'm going to use it in my term paper"), there are better ways to accomplish the same end.

I suppose I should also state explicitly that I'm not against including published citations. -dmh 5 July 2005 20:07 (UTC)

Position 2 (demonstration of independent use)[edit]

First, let me try to describe this approach. Judging from discussion on RFD, there is apparently an impression that position 2 means "google the term and accept it if there are a few hits." This is certainly not the approach I use, and there is quite a bit of text in CFI aimed specifically at disallowing such a simplistic rule.

If I'm trying to support a term, I will very often start with a google search (often along with a BNC search and sometimes a quick look at dictionary.com — for attestation only, of course). If it turns up nothing, or only Wiktionary and its mirrors, that is a reasonable indication that no one is using the term (but there are occasional exceptions). If all (non-wiktionary) usages are linked to a single site or to clearly related sites, then there is no evidence of independence. In such cases, there will generally only be a few dozen hits to sort through.

On the other hand, if there are dozens or hundreds of hits, it's time to start looking for running text. Often there is quite a bit of noise, but it's generally not hard to find a few uses on different sites on fairly widely separated dates, in different contexts. I will only consider a term supported if this is true. If all I find are repeats of the same quote (and this sometimes happens), I don't consider the usage to be attested.

If there are thousands of hits, I still check to see if there enough quotes to support each sense given in the definition, but there is obviously no need to check every single hit. At worst, you miss senses you otherwise would have found, and this can be rectified later.

All of the above applies per sense, not per term. If there are a dozen uses of a term, but no two appear to use it in the same sense, there is no support for any single sense and the term as a whole is not supported.

The controversial cases tend to be ones where google only turns up a handful of supporting quotes, and these are (almost inevitably) from blogs, personal pages or other informal sources. The controversy is over whether this is good enough, and if not, what is.

I'm generally hard-pressed to explain why a term would appear in separate places on separate dates with the same apparent meaning unless it was commonly understood, by parties not otherwise connected, to mean something. There can always be extenuating circumstances, but our stated bias is toward inclusion, and the burden of proof should be on the assertion that the sightings are coincidence or all stem from a common source. To be clear, I personally do try specifically to rule these out.

It's quite possible that someone could coin a word, insert it into a few different publicly acessible places and then create a corresponding entry in Wiktionary. In fact, it's happened at least once, and the case I'm thinking of was easily detected and was eventually rejected. Even if it hadn't been, it's hard to see the harm. There are plenty of embarrassments in Wiktionary as it is. I'd rather include interesting usage than keep it out on the off chance someone might try to sneak one by. -dmh 17:21, 26 May 2005 (UTC)


(I've included some comments inline, in italics -dmh 6 July 2005 17:26 (UTC))

  • <Jun-Dai 19:09, 26 May 2005 (UTC)> In response to:
    • I'm generally hard-pressed to explain why a term would appear in separate places on separate dates with the same apparent meaning unless it was commonly understood, by parties not otherwise connected, to mean something.
      The clearest case where this would happen is when someone coins a term that can easily be understood, but is not generally known as a term. In the discussion about tubgirl we might be seeing this phenomenon. "Everyone" knows what tubgirl refers to,
Actually, I didn't. Fortunately, Wikipedia gives a summary of the underlying image. -dmh 16:20, 14 Jun 2005 (UTC)
    • so when someone says "I tubgirled that forum" or somesuch, it is clear to the reader what is meant, even if the term is not "commonly understood to mean something" or even commonly understood to exist. I'm not myself advocating a position on this matter, it's just that it is easy to see how words can be constructed and the meaning understood, even when the word doesn't really exist in any general usage. We all have different places to draw the line for the point at which a term enters the language, and this is part of the problem we are dealing with. Let me break down a couple of points around which we are seeing controversy:
      • Terms can easily come into existence on the Internet, and they can disappear just as fast.
        This isn't just true for the Internet. Any group of friends, for example, can construct a term. I'm sure many mailing lists about a particular subject have developed special terminology that never left that mailing list in a significant way. If we searched usenet for new terms, I'm sure we could quadruple the size of the OED, and yet most of those terms would already be dead words. Is there any reason that we should be biased towards printed forms for our citations about this? It seems like it would be as easy for fans of the band Dream Theater to create some new Dream Theater-related terms and mention them in their fanzine as it would any online community to do the same.

Terms easily come and go everywhere, not just on the internet. The internet records this phenomenon better, which is basically a feature, not a bug. The independence and one-year timespan criteria are aimed specifically at this issue (arguably the one year limit is a special case of the independence criterion). If independent sources are found to use the same term in the same sense more than a year apart, this creates a very strong presumption that the term is in use outside those instances.

      • People can construct terms the meaning of which might be immediately apparent but which never really enter the language.
        Corollary: if the basic term and idea is a popular one, this can happen in a number of isolated instances. A million monkeys and a million keyboards all connected to the Internet will surely produce the same newly coined term in multiplicity. I don't know if the term whiteboard as a verb has seen much use--I've never heard of it--but I could easily say something like "I was having trouble visualizing the data model, so I spent all afternoon whiteboarding it," and I'm sure that most people in my office would understand me. Moreover, I'm sure that this verb has been constructed and just as quickly abandoned by several dozen people at the very least. This too could happen as easily in print (and in this case probably has) as on the Internet. I suspect that verbing nouns in printed form has existed long before the Internet.

The idiomaticity criterion is aimed exactly at this. If you can tell by looking at a new term what it means, then you don't need a dictionary, and there is no need to include the term here. On the other hand, there is not a strong need to exclude it should someone choose to enter it.

      • A term can reach tremendous popularity fairly quickly and still die off in a short time, and still deserve to be in a dictionary.
        I suggest google (as a verb) to be an example of this. The word has been used in journalism, all over the Internet, in commonplace speech, and probably even in narrative fiction. If, however, google.com went belly up next year, and disappeared from the search engine industry (hard to imagine, but I'm trying to condense the scale of this a little), the term would--most likely--die off in a few years. It might not (I doubt people would start saying "I MSNed it, and couldn't find much"), but yet it could fairly easily--social patterns are not always easy to predict. Other terms see temporary use, often regionally. We used to say moded in the 80s here in California a lot, and now people would look at you funny if you said that as slang (context: "oh, he's hella moded.").

To me, that's not a problem. We record all kinds of dated and outright obsolete usages. In terms of the basic guideline, if a word was widely used at some point, it's likely that someone will run across it (reading older material, or hearing one's elders, for example) and — particularly since the term is no longer current — want to know what it means. My personal opinion, by the way, is that words don't generally disappear as rapidly as they appear. They may be used orders of magnitude less frequently than in their heyday, but they probably don't disappear entirely for a good long time.

      • Whom are we trying to serve?
        I.e., should we be serving everyone that has a term they would like to look up, or only people that are looking up terms in printed form, or very general usage, etc.? This is part of a larger problem that I mention below.
      • Why should we impose limits?
        We are, after all, trying to avoid the notion that we have finite data-storing resources. We are not a paper volume, and we don't have the same pressing reasons for limiting our wordset. If we have different ideas of what words are "proper", is there really a good reason for those of us with more restrictive views to impose our views on those with less restrictive ones? Does it really hurt our project if we have obscure, obscene, tech-geeky, and/or improper words in the project? Sure, I might not want to enter teh as an alternate spelling or common misspelling of the (my perennial example), but why should I erase the work of someone who has added that? It certainly is true, and if anyone ever came here looking it up, that might be useful information (though in that particular case, it would be pretty strange). More realistically, someone might want to look up separate, and they might type in the common mispelling seperate. We can give them a useful service by showing them a link that says "common mispelling of separate". I don't feel that we need to provide that service, but I see no reason to prevent someone from including it.
    • In summary: it's hard to define where we should, as a group, draw these lines. We can probably agree that tubgirl, as a verb, has not really entered the language in a meaningful sense (as a noun, it is more controversial); it is probably an example of the second phenomenon I mentioned above. Does that mean we shouldn't have it? This sort of depends on what purpose the Wiktionary is supposed to serve. If we are to be like the OED, a source of primary research on language, then yes, it should be documented. I'm sure there have been hundreds of people that have encountered the term tubgirl and not known what it meant--are we aiming to serve them? Should they be able to come to the Wiktionary with the expectation of learning what the term means? Or do we aim to be more "proper" than that? I think we need a clearer vision for this project. Trying to determine the vision based on our criteria for inclusion is sort of backwards--we should determine the vision, and then settle on what the criteria for inclusion are based on that.

Hear hear. To me, CFI is just a means of recording current practice and whatever consensus and disagreements there may be. My more recent efforts have been toward that end.

I tend to agree with all this, particularly the "Why should we impose limits?" section. I'm frankly puzzled by the notion that "neologisms" or "protologisms" (in the expanded, and itself protologistic, sense of "narrowly used new words") are some sort of blemish and need to be kept out of Wiktionary.
I would put verbed nouns in general in same class as non-idiomatic inflections, except that there doesn't seem to be a precise rule for determining which possible meaning a verbed noun will have. Perhaps X as a verb generally means "use X as an agent for some context-specific task", as
  • I'll whiteboard it (i.e., I'll use a whiteboard to help design it.)
  • I'll email you. (i.e., I'll use email to contact you)
  • I'll email the images. (i.e., I'll use email to send the images).
On the other hand, if we're trying to define terms in isolation, we would probably have to call out both "send email to" and "use email to send" as transitive senses of the verb "email".
Back to CFI. My experience with potentially productive constructions — verbing nouns, tacking on affixes like e- and -illion, using Pig-Latin or funny spellings, abbreviating a long word with the first and last letters and a number in between, and so forth — is that the actual usage is not what one would expect by sheer chance. Some nouns are verbed more than others, and in particular ways. Affixes are not applied to just any word, only a few Pig-Latin constructions or other funny spellings are in common use, only a few long words are abbreviated with numbers, and so forth. There is something linguistic going on here, and it would be worthwhile to record the raw data, particularly since there don't seem to be clear and universally accepted explanations for all of these. By contrast, inflections like -ed and -s are well-known and uncontroversial. They also apply nearly universally (in fact, -ing does apply universally — not even be is irregular in this respect).
In short, I would tend to include nonce-like usages like whiteboarding, on these grounds. Again, there seems to be little harm, and we don't seem to have a problem with regular inflections. In other words, the "idiomatic" clause seems to apply more to roots than to inflected forms, but the "attested" clause applies to both. -dmh 16:20, 14 Jun 2005 (UTC)