Concordance talk:Sherlock Holmes
Interesting. How about renaming this page as "Concordance of Sherlock Holmes", as that is what it is.
- Sounds fine with me. —Długosz
How are "trap" or "puritanical" rare or obsolete? These are everyday words. "Moor" is also well known, although possibly not outside the UK.
- I think the usage was strange, but I don't remember. I made a list of words to look up as I was reading, and that was on it. —Długosz
Might this appendix also be a "subappendix" of an appendix "Appendix:Concordances" which could include concordances of the Bible and the works of major authors such as Shakespeare, Dickens, Milton, etc? -- Paul G 09:49, 18 Mar 2004 (UTC)
- I do plan on adding others. —Długosz
By the way, what is "the Canon", and where does this list come from? Has it been copied from somewhere with permission? -- Paul G 09:51, 18 Mar 2004 (UTC)
- The Canon is the official list of stories by the author, as opposed to those written later by others. I made the concordance myself, based on the full texts found at Project Gutenberg. —Długosz
I admit that my immediate reaction when I first saw this was negative. Further reflection at least led me to believe that there is a place for this kind of thing. I can't yet be sure that this is the right place. I do have a few observations.
- Concordance would be a more appropriate name, but a name change can wait until we have decided where this is going.
- I agree that some of the words that were cited as being rare are not really rare, but this is a fairly minor point in the context of this initiative.
- If there is to be a "Canon" it should be very clear what is on that canon, and it should not depend only on what is included in Project Gutenberg. A little work has already been done to include some of Doyle's works on Wikisource (see . I normally try to discourage putting things on Wikisource when a suitable version is easily available, but establishing a concordance may be cause for reviewing that opinion. Eclecticology 19:27, 29 Mar 2004 (UTC)
- I like the name Concordance. If someone with suitable powers would rename it (and the one that points to it), that's great with me.
- There are numerous word lists on Wiktionary already, and at the very least they serve a purpose of collecting words that need to be in a dictionary. Meanwhile, I find myself needing to look up words in a dictionary when I read something that's from another country and another time, so even rare and obsolete words need to be part of a dictionary if they are part of popular literature. Lastly, I find that using quotations from Holmes to illustrate words is more interesting then using Shakespere or the King James Bible or poetry, as found in the 1913 Webster's. I just can't relate to the old qoutes, and there is no context other than the part of speech! As for leaving proper names in the Lexicon, it's for completeness and I'm hoping that having a Concordance on line (rather than just a germination for my to-do list) will attract the Holmes on-line community to start adding definitions as well. I'm not making entries for fictional proper names. —Długosz
We are basically in agreement on your main points. If I hesitate to make the needed move it's because I want to look at this in terms that go beyond Sherlock Holmes. As things stand Wiktionary has about 1,500 pages of indexing and other meta-data; I would like this initiative to become a part of the solution, rather than a part of the problem. I agree that rare and obsolete words should be a part of Wiktionary, and I believe that they will all be added in the course of time. The 1913 Webster alone claims about 109,000 entries, so there is still a lot to be done. Shakespeare and the KJV are a part of the history of the language, so their uses are important. So too would be several other authors who represent the usage of the language in their own time; Arthur Conan Doyle is an excellent example for this kind of treatment.
Note that I say Doyle rather than Holmes. I think that some of Doyle's other works should be a part of this. This is one reason why we should not depend on Project Gutenberg to define our corpus. I would build into this initiative the capacity to create such a list for a single Holmes story as easily as for the entire body of work. I would appreciate your feedback on how your idea could fit into a bigger picture.
BTW, I glanced through your "a" list. I wouldn't be too quick to say say that "agave" is a typo; the usage is unusual in that way, but a rationale in its support could be made. It is clear that he was not referring to the cactus whose name is spelled the same way. :-) Also in considering "acetones" in the plural, what you say may be very true in today's understanding of chemistry, but the proper reference point is with the state of popular chemical understanding when the story was written. Eclecticology 19:21, 31 Mar 2004 (UTC)
Would someone please define Concordance !
What on earth is a Concordance ??? No definition in Wiktionary. B efore you try to decide if this is or is not a concordance, perhaps someone can do the ground work of defining what a concordance is ?--Richardb 12:57, 14 Nov 2004 (UTC)
- Good point. I'll put that higher on my priority list. For now, a concordance is an alphabetical list of words found in a text or group of texts. These often (but not necessarily include citations of the places where they occur. The most famous concordance is that of the Bible. Eclecticology 17:26, 15 Nov 2004 (UTC)
Did Conan Doyle invent / bring into being any new words ?
A number of authors (weel, may be only Lewis Carroll) around the time of Arthur Conan Doyle invented or brought to the public new or otherwise unknown words. If Conan Doyle did this, it would be interesting to list those words here as a specific section.--Richardb 12:57, 14 Nov 2004 (UTC)
- One reason for Doyle's continued popularity is his easy writing style. I can't back up my statement right now, but it would come as a surprise to discover that any writer of stature would not have added any new words to the language. Eclecticology 17:26, 15 Nov 2004 (UTC)
What is the point of a Concordance in Wiktionary ?
The reader could just directly look the word up. Why do we need a list of all words used in Sherlock Holmes stories, in a dictionary?--Richardb 15:10, 19 Nov 2004 (UTC)
- The point that seems to be missing is the purpose of a dictionary. A descriptive dictionary will chronicle a word's usage over time. The list of quotations that it contains will illustrate those usages. A prescriptive dictionary merely gives you definitions of words, and tells you how to use that word. If all we were doing were the latter your point would be well taken. Eclecticology 00:17, 20 Nov 2004 (UTC)
- Fair enough. Don't think it would be my personal choice, but then it wouldn't be a Wiki would it if people couldn't adapt it to what they want it to be --Richardb 05:53, 20 Nov 2004 (UTC)
Very cool stuff here!
Dlugosz, excellently done. I just tripped over this this evening. Is there any effort to take *all* the books on Project Guttenburg, form a word-list of the whole mess, then perhaps even remove defined terms? Perhaps even remove plurals and other senses of known words? Hmmmm. Maybe it is time I got an off line copy of Wiktionary to play with. --Connel MacKenzie 07:50, 14 Jan 2005 (UTC)
- I really liked the idea when he started this, and was sad to see him abandon the project. The complete set of the Sherlock Holmes stories is now in Wikisource, and I've helped to add several more Conan Doyle works over there. It would be nice to take this list and break it down by story. (MS Word allows a text to be sorted alphabetically.) The usages could then be traced through the stories and cross-referenced. This would take a lot of work, but if we ever had the manpower ... Eclecticology 00:52, 19 Jan 2005 (UTC)
- Yes, when I realized what I was looking at, the wheels started turning. Wiki is far too addictive as it is. But it sure would be neat to have a list of undefined words RANKED by number of uses, of the combined mess. I'm a MUMPS database programmer by trade and choice. It shouldn't be too hard to index the words to my liking, but that is a lot of data no matter how you look at it. Perhaps if I buy myself a new disk drive as a belated self Christmas present...
- I think an amalgamation of as much Conan Doyle as I can find would probably approximate most common words in the whole language; having some precedence for definitions would be encouraging, (much like the top 2000 linked words at Special:Wantedpages.)
- A cruder approach would be to wikify every single word then post the result to my talk page...wait a week, then check back into the aforementioned automatic list. I think that would be borderline evil, though. Can Wiktionary handle many-megabytes of wikified links at a time? Maybe broken onto separate pages by chapter? --Connel MacKenzie 05:14, 19 Jan 2005 (UTC)
A whopping 7 lines of code, executed on the 71KB file, and voila; User talk:Connel MacKenzie/doyle. The Adventure of the Bruce-Partington Plans by Sir Arthur Conan Doyle (first line got cut from my debugging.) --Connel MacKenzie 09:40, 19 Jan 2005 (UTC)
- It's nice to see that your week only lasted 4 hours and 26 minutes. :-) Simply Wikifying every word in an article underestimates the power of this tool. The difficulty that such a process causes is that when anyone looks up the article the server needs to check every wikified word just to determine whether it should be written in red or blue.
- Ahem, I *did* say it was a crude approach.
- I'm not familiar with MUMPS, so I can't comment on its potential in this context. Still I don't see disk storage as the major impediment. Enough manpower to develop some of the ideas is a much bigger problem. This kind of tool can be used to trace the meaning and usage of a word over the centuries if we do this with a series of authors as representative of their own time period. The 71kb for your sample story seems about the right size. This could allow for separate analyses of each story as it relates statistically to the entire corpus of Doyle's work. There might be material here for a Wikiversity doctoral thesis! Eclecticology 20:24, 19 Jan 2005 (UTC)
- I think I need some kind of reality check. Wikiversity? Thesis? Doctoal Thesis! Oh my. I'm not driven by any of that, I'm just quite curious.
- My initial test of doing a story was both encouraging and disappointing. I am delighted to see that the majority of the words are blue. The reds are mostly proper names or plurals or other adjective senses (for defined words that include them in their definition.) I had hoped that perhaps an entire section would stand out. But on reflection, it seems the crude approach has some tremendous value.
- Project Gutenburg has about 1.2 GB of zip files of text waiting for me. The next thing I'd like to try is to build an index of words, with only a count of occurrences. (That was an extra 2 lines to the above snippet of code.) I'd like to then subtract words that are defined in Wiktionary. Then subtract other senses of words that are defined (i.e. anything in bold or wikified on the same line that the entry text appears.) E.g. refrigerate would yield refrigerated and refrigerating. Subtract those (that should knock out the obvious plurals and adjectives.) Then spit the remaining words back out in order of occurrences.
- For such an exercise to have a significant (er, overwhelming?) effect on Special:Wantedpages, I'd have to re-list them all at least twenty times. I'm sure there's a better way, but at least this is dynamic (for newly added words) as well as tried and true. Obviously, I'm not worried about disk space on wikt., but the rendereing CPU cycles is a significant factor. But these are not pages that *anyone* really should ever load. They are just being stuffed to skew the results of Special:Wantedpages more towards the English language.
- Of all the disks I have laying around, I don't think a single one has 1.2GB free. If I take the time to archive a plethora of stuff off my laptop this weekend, I should be able to start my 'wget' Saturday evening. I'm sure I can unzip and process the files as they come in, freeing more space as I go. (The MUMPS database will not expand in any significant amount...key compression, only saving counts, dynamiclly extensible, load balancing, balanced b-tree, etc.) so that megabyte (or two) won't be a factor.
- But I'm not real good at parsing raw SQLdumps. I saw Hippietrail's code around somewhere, and it looked simple enough. But then I'll need disk space for the raw dump as well. :-(
- Yeah, the proof of concept I did a couple hours later, but the result of my next exercise will take two to three weeks - most of that time freeing up disk space. And my employer wants me to do real work (imagine that!) during that time, so I probably won't finish this before Easter (looking at it realistically.)
- I think I better post a question on Hippietrail's talk page, and see if he has a list of defined terms. Even better, would be a list of defined terms that have more than, say 50 characters, in their article entry. (100 characters?) Then, even weak definitions could be included in the resulting list.
- Hasn't anyone tried this before? --Connel MacKenzie 03:22, 20 Jan 2005 (UTC)
- I'm still baffled that this exercise isn't done frequently. I got the information I needed from Hippietrail's user pages (I wonder if he knows how much that stuff helps, randomly?) downloaded and imported all the ASCII (big sacrifice - I know) English pages of Wiktionary, and built a table of all words that don't reference an existing article title. (I also excluded plurals and other senses in a separate list.) I'll soon be able to take that and plop the top thousand or (two or three) words into one of my usertalk subpages, in order of references. Perhaps I should import all of WIKIPEDIA doing the same thing; list all words that have no wiktionary definition. I'll also need to work on some kind of automation for propogating redirects for the plurals and other senses. Fully automated won't work, as there is plenty of cruft there (and who knows how many have been defined since 7-JAN-2005.) But I'm trying to think of ways to get it down to just a single click to submit the appropriate form with the redirect already stuffed.
- I've made some decent progress at work (on work, not wiki) which always seems to raise my spirits. But I haven't made any progress at all on clearing space off this laptop for the concordance data-slam. Also, I'm less impressed now at that first attempt...removing ALL punctuation is not a good long-term approach. Since I preserve the strings I'm traversing, it isn't too hard to reinsert the punctuation after wikifying the individual words. But that does make it possible to break wikiiing of some oddly punctuated text. I'll have to try it, to see if it's worth the trade off (all punctuation, some broken links, vs. no punctuation, guaranteed no syntactically broken links.)
- Must sleep now. --Connel MacKenzie 09:02, 27 Jan 2005 (UTC)
What can be done here extends far beyond simply finding a listing of words to include. Hippietrail's algorithms have been fruitful in finding things that really should be fixed. The problem is in all of us finding time to fix everything there that needs it. The results that some of these bots produced can be discouraging, and running them more frequently can only make that sentiment worse. Applying these techniqes to the whole of Wikipedia!!! seems to be a task on a par with using a drinking straw to suck up the entire ocean. We would do better to both work out the bugs and explore the potential in the context of the single Doyle story. Once that is done a secont text can be used for hypothesis testing.
I agree that stripping out all the punctuation may not be the best course. Apostrophes and hyphens especially are usually best left in. Even segregating plurals and other inflections may not be a good thing. While I still consider the creation of Wiktionary articles on most plural words to be a harmless exercise of dubious value, I attach some importance to the integrity of the original text. Studies in the statistical analysis of an author's texts (helpful in establishing the authenticity of a text when authorship is disputed) may depend on such data as how often the author uses the plural. Eclecticology 10:02, 28 Jan 2005 (UTC)
- Perhaps you misunderstood the warm-up sub-task I'm trying now. By composing a list of words that "should" be redirected, along with the word they should be redirected to, seems a useful exercise, especially if I can clean up the exceptions during a manual phase. Then to pass the list off to Hippietrail to automate? Perhaps even as a direct SQL load? If not, then as a manual list on my talk page to let people cut-n-paste the redirects. I completely understand and sympathize with the negitave sentiment about bots.
- Now, the only technique I was talking about doing was running all of wikipedia on MY LOCAL COPY to generate a word frequency count (dropping the actual text after counting.) By ensuring that common words are defined first, wiktionary ends up looking better to an outside observer. (That's my theory, and I'm sticking to it!)
- I think you are right. I'll try this again, filtering only "[", "_" and "]". Pasting that test in the above page in just a minute from now...
- Back to the concordances...what a cool idea! List the top 100 words (or 1000, 2k, whatever) per book. Sorted by date, yes, I do think you are right; that will be a telling indication of language trends! Even just one author, showing how language use morphed (um, what's the word I'm looking for) during a single authors lifetime...comparing overlaps. I agree that this seems quite worthwhile. Right now, I still need some local diskspace. --Connel MacKenzie 11:36, 28 Jan 2005 (UTC)
OK, as you requested then: a frequency count of just that first story. User talk:Connel MacKenzie/test2. Clearly my simple parsing is too simple. But is this what you were getting at? I think it would be more worthwhile to exclude entries that occur three times or less. Doing the same for only the top 500 words yields User talk:Connel MacKenzie/test3.
Hmmmmm. Maybe this would be a more interesting study to look at duplets - do the same analysis on word-pairs? Excluding all ones of course...again, results for only the first (little over) 200 word pairs can now be found at User talk:Connel MacKenzie/doyle signature. That looks like the signature you were asking for. Hmmm, I wonder how many of these word-pairs are wikified?
--Connel MacKenzie 09:09, 30 Jan 2005 (UTC)
heh. Only 6 of the wikified duplets were defined. --Connel MacKenzie 09:12, 30 Jan 2005 (UTC)
- There's lots of interesting material in what you've done. I see that you already noticed about the separate listings for "he", "He" and "he.". This suggests that a case insensitive analysis might be useful. The "... said he." construction can be a useful indicator of an author's style, so it may be necessary to deal with some of these manually.
- I've also drawn a couple of other observations from the data. The most common word for which Wiktionary would need a definition was "suppose", which occurred 10 times. There were three different explanations for words in red that occurred more frequently: simple plurals of words that we have already defined, words that included punctuation, and proper nouns.
- There were 10,818 words in the text. The 88 most common words accounted for half of that number. Each of these appeared at least 17 times.
- The words that occur less frequently would give us fertile sources of quotes to illustrate the use of these words. In addition to Doyle I think that this project in co-operation with Wikisource should be building a series of reference authors. Wikisource would strive to include everything available by these authors within its database. A possible set of first entries, chosen to represent their time periods might be Francis Bacon (1600), Richard Steele (1700), Walter Scott (1800) and Doyle (1900). Shakespeare was specifically out of consideration because he wrote in verse. Others, of course, would be added later. Eclecticology 21:41, 30 Jan 2005 (UTC)
OK, should I just pick the earliest of each, the one published closest to the century mark each, or do you have a favorite for each? Please give me a list of title requests, and I'll knock a 1/2 dozen off in short order... --Connel MacKenzie 04:15, 31 Jan 2005 (UTC)
OK, I'm pulling:
VALERIUS TERMINUS: OF THE INTERPRETATION OF NATURE (by Sir Francis Bacon)
ISAAC BICKERSTAFF, PHYSICIAN AND ASTROLOGER. by Richard Steele.
LETTERS ON DEMONOLOGY AND WITCHCRAFT BY SIR WALTER SCOTT, BART.
....--Connel MacKenzie 08:09, 31 Jan 2005 (UTC)
Well, DEMONS must posess my wiki'd text of Demonology...it's almost 1MB of wiki'd text, and for some reason, the server choked when I tried to submit it. (Stupid me, I tried a second time.) Perhaps I'll split it up tomorrow, perhaps not. As you pointed out earlier, the straight wiki'd text is not of tremendous value. Some yes, but not so important. Good night! --Connel MacKenzie 09:49, 31 Jan 2005 (UTC)
- I agree about the edit button (but it's still not as long as some sections of the "prescriptivism" debate). :-) You've made an interesting selection from Bacon, Steele and Scott -- It makes me wonder whether we might have better chosen one of Doyle's pieces about the spirit world. In any event I should look at the possibility of including the cited works in Wikisource. The difficulty that you had with the Scott text is no doubt a function of its size. Comparing texts may require that we define a standard size, and choose chapters and stories that come within range of that. At this early stage what work we choose from a representative author should not matter, but I forsee that the differences may become more apparent as our system becomes more refined. Eclecticology 20:06, 31 Jan 2005 (UTC)
- The selections I made were rather odd, but dictated mostly by what I saw was available on P.G. (Steele's only solo work was the one above...the rest of his on P.G. were co-authored.) I wouldn't recommend these as a reading list! I think I'll take the first 100Kb of the Scott text...
- So far: #1, #2, #3, in each of these lists show surprising similarity
- --Connel MacKenzie 04:09, 1 Feb 2005 (UTC)
- Interesting. For comparison purposes I would recommend that the data be normalized to show occurrences per 10,000 words. This would make the results independent of the text length. Eclecticology 17:56, 1 Feb 2005 (UTC)
Apparently I was too optimistic in estimating Easter as a milestone date. Work has recently picked up in intensity by an order of magnitude. I'd like to remember to get back to this when things calm down. --Connel MacKenzie 06:17, 20 Mar 2005 (UTC)
- No problem. If you aren't kept busy with work, there's always Mr.A.P. to waste your time. :-) Eclecticology 07:00, 20 Mar 2005 (UTC)
- I don't recall if there was more I wanted to do with these or not. --Connel MacKenzie T C 01:06, 29 June 2006 (UTC)
Text moved from article
The text (below) moved from the main article, reasons:
Either these words are obscure so there must be many more in "Holmes"
or they are not obscure.
These are some rare or obsolete words that even a native English speaker might need to look up when reading Doyle's Sherlock Holmes stories. * moor * cobs (description of horse) * balustraded * puritanical * tor * ecarte * trap (kind of vehicle) * betimes * dewlap
Saltmarsh 12:37, 30 December 2005 (UTC)
The following discussion has been moved from Wiktionary:Requests for moves, mergers and splits.
This discussion is no longer live and is left here as an archive. Please do not modify this conversation, but feel free to discuss its conclusions.
I propose moving Concordance:Holmes A to Concordance:Sherlock Holmes/A. By extension, Concordance:Holmes B would be moved Concordance:Sherlock Holmes/B, and all other concordances from Sherlock Holmes would follow the same naming system. If this comes into effect, these pages will be subpages of Sherlock Holmes, which I believe is very desirable. --Daniel. 12:42, 14 October 2010 (UTC)
- Support. --Dan Polansky 09:56, 19 October 2010 (UTC)