Wiktionary talk:Public domain sources

From Wiktionary, the free dictionary
Jump to navigation Jump to search

WordNet[edit]

Have you Wiktionarians given any thought to using a public domain dictionary and thesaurus as a base for the Wiktionary? The 1913 Webster's Dictionary is available here, and you can get an old copy of Roget's Thesaurus from Project Gutenberg. — This unsigned comment was added by 207.179.133.40 (talk) at 12 December 2002.

The GNU Project has GCIDE, the GNU version of the Collaborative International Dictionary of English, which is available under the GNU GPL.

There is also the far more modern, and therefore useful, WordNet dictionary created by Princeton University. Their license looks like it might be compatible with the GNU FDL so long as we have a link to the below the text under maybe a ==References== heading:

          This software and database is being provided to you, the 
          LICENSEE, by Princeton University under the following 
          license.  By obtaining, using and/or copying this 
          software and database, you agree that you have read, 
          understood, and will comply with these terms and 
          conditions.: 
          Permission to use, copy, modify and distribute this 
          software and database and its documentation for any 
          purpose and without fee or royalty is hereby granted, 
          provided that you agree to comply with the following 
          copyright notice and statements, including the 
          disclaimer, and that the same appear on ALL copies of the 
          software, database and documentation, including 
          modifications that you make for internal use or for 
          distribution. 
          WordNet 1.7 Copyright 2001 by Princeton University.  All 
          rights reserved. 
          THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND 
          PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR 
          WARRANTIES, EXPRESS OR IMPLIED.  BY WAY OF EXAMPLE, BUT 
          NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO 
          REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR 
          FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE 
          LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT 
          INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS 
          OR OTHER RIGHTS. 
          The name of Princeton University or Princeton may not be 
          used in advertising or publicity pertaining to 
          distribution of the software and/or database.  Title to 
          copyright in this software, database and any associated 
          documentation shall at all times remain with Princeton 
          University and LICENSEE agrees to preserve same. 

I reformatted their entry on dog to redo our dog article. Notice that synonyms are already included for each definition. IMO the corresponding 1913 dictionary definition is far inferior and needs to be heavily edited. I've provided that on dog's talk page for comparison. If WordNet is our primary source then we could still use the older texts to selectively add other information (such as I did for the pronunciation). --Maveric149, 13 December 2002

Can we get Princeton to donate their content under the GFDL? This would dispel any remaining worries about licensing. Looking at their existing licence, they might well be willing to do this if we ask nicely. The Anome, 16 December 2002
Good idea. I've started a letter at Wiktionary:Princeton wordnet. -- Merphant, 16 December 2002

Webster 1913[edit]

See Wiktionary:Webster's Dictionary (1913) for some efforts based on the 1999 Gutenberg transcription of the public domain 1913 Webster's Dictionary. Yes, I know it's not ideal, but at least its a start for later editing based on other sources.

See abate for an example of the derived markup. I have deliberately tried to make the output of the filter suitable for later processing, for use when a template format is agreed on. If this isn't right... at least it's wrong in a way that should drive constructive suggestions for improvements. -- The Anome 23:35 Dec 15, 2002 (UTC)

A question: would it be useful if I was to reformat the Webster entries using the format of dog as a basis? This could then act as a skeleton for later updates, both human and automatic. -- The Anome 08:43 Dec 16, 2002 (UTC)

A subsequent note: my Wikification of the 1999 Gutenberg version of the 1913 Webster's seems to be more or less equivalent to GCIDE, as it uses the same SGML-style source material, generated by the people at MICRA. However, GCIDE appears to have no current active community.

Posting the Webster stuff here would therefore have more-or-less the same effect intended by the GCIDE project, and Wiki stuff could be re-imported into the XML GCIDE later. The more structured the entries here are, the easier it will be to move material between the two formats. The Anome, 16 December 2002


A proposal: I can dump the whole 1913 Webster's in here as articles of a form such as Abate (Webster 1913) or Webster 1913:Abate or Webster 1913/Abate. I can do this with the Websterbot, thus allowing later additions of for example (suject to permission), WordNet:Abate or WordNet/Abate, and leaving the main "namespace" un-polluted by automated article drafts, leaving these articles as a resource for the manual or automatic creation of articles in the main namespace based on the agreed final template.

What do you think? If there are no strong protests, I will fire this up later today, as this seems to have some benefits (not least, the presence of at least a poor first draft for most English words), whilst not affecting the main namespace in any significant way. The Anome 14:12 Dec 17, 2002 (UTC)

But none of the proposed "namespaces" are in fact real namespaces. They still will be "articles" as far as the software is concerned. Anybody using the random page function will land on them and they will be counted in the stats as articles. The 1913 dictionary isn't going to go anywhere. --mav

Not if we make "Webster 1913:" a real Wikipedia namespace. This should take about two lines of code (I hope) and would hide the articles completely from random inspection, whilst keeping them visible for anyone who wants them. -- The Anome, 17 December 2002

Yes. please do... In fact some altering of the internal script to create automatic links to the included public domain dictionaries would be ideal... In fact I think I could probably do it myself, If somebody with a mirror of the wiktionary would allow me to use it as a testing ground i would be able to test this without actually messing with the real codebase, which would likely cause unnecessary downtime. Please note that a namespace would be needed first though. Tacvek, 13 June 2003


I have now started the Websterbot up. Please note that none of this is meant to subvert or pre-empt the writing of original entries in whatever format is eventually agreed by the Wiktionarians: that's why these articles are marked as "Webster 1913:". Its edits don't show in Recent Changes: to see what it is up to, click here.

The intention is to be a "pump-priming" exercise, by providing:

  • abundant source material, including quotations, in a pre-Wikied form
  • a matrix of "wanted article" links back to the main namespace

Please bear with me for now, and allow this experiment. The articles can later be reformatted, renamed, or moved into their own namespace, or deleted -- but my intention is that they should serve the purpose of being a feed-stock for the new dictionary for at least the next month or two. The Anome, 20 December 2002

Phonetic transcription[edit]

About phonetic transcription: There are two kinds of sources one could use here. On the one hand, ready-compiled word lists with transcription (pronunciation lexicons); on the other hand, grapheme-to-phoneme rules which can give a tentative transcription of an unknown letter combination (for a given language, of course). A good point to start would be http://freetts.sourceforge.net, which is about as open as source can be. I suppose their license would allow use of their data and code for Wiktionary, although someone should check this.

In a first step, one could import their pronunciation lexicon, for US English and using a phonetic alphabet which would need to be converted to IPA or SAMPA (actually, there is a conversion table towards SAMPA with the MBROLA voices in FreeTTS).

In the longer term, it would be cool to enhance the Wiki software to add an automatic transcription functionality to the Template for new entries. This assumes the software can identify which word to pronounce, but that may simply be the entry title. So this could actually work. The transcription software would have to run on the server and be called by the code generating the template for a new entry. Probably it would be necessary to dig into FreeTTS code a bit to stop once the word is transcribed (we don't need it spoken, do we? -- actually, even that could be cool!)

Anyway, so much for this. Unfortunately, within the next few months I don't see when I could be doing this stuff myself. Any opinions on this? Anyone feels this could be worth trying?

MarcS 22:13 Feb 14, 2003 (UTC)

Oxford English Dictionary[edit]

The first fascicle of the Oxford English Dictionary was published in 1884, and it was published in fascicles until completion in 1928. I don't know if anyone has produced a publicly available electronic version of the portions of the dictionary currently in the public domain, but they are a great source of word history, if Wiktionary wants any of that. Dfeuer 20:42, 27 January 2006 (UTC)[reply]

Someone is scanning/photographing the fascicles. See his letter of intent and his progress as of March 16 2006 where he's uploaded some sections. He's scanning those published in the US before 1923 (maybe because in the UK the copyright is extended to author's life + 70 years). Doesn't look like there's any plain text files converted using OCR. --Bequw¢τ 02:23, 23 September 2008 (UTC)[reply]

anachronism detector[edit]

Meanwhile on English Wikisource: Correcting an 1883 'done' that the OCR had seen as the 1903-or-later 'clone', I got to thinking about https://merriam-webster.com/time-traveler and how something similar might be done using the combined corpuses of Wiktionary and Wikisource. A positive will either indicate a new terminus post quem or a misread. But is it feasible? Arlo Barnes (talk) 09:23, 22 January 2024 (UTC)[reply]