User talk:PolyBot

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Conversation with Allan[edit]

Hi Allan,

In which programming language are you writing those bots? I'm doing something similar as your first bot does. I think I'm taking it a bit further though. I'm disassembling an entry entirely to its most basic form, then building it up again in the proper format. Anyway, maybe we can work together on this. Or maybe you can concentrate on the Webster's bot. If you want to see what I'm doing, you can have a look here: User:PolyBot. There is a link to the code on that page. Polyglot 23:20, 7 September 2005 (UTC)[reply]

Polyglot: Your Wiktionary 'bot is taking exactly the same approach as I planned. I also use a grammar to parse the entry, but mine goes deeper (picking out plurals and participles). I prefer your approach, since mine is very sensitive to formatting inconsistencies.
I have a slightly different purpose to yours. Apart from the little formatting irregularities (most of which can't be fixed automatically anyway), my main concern is in links between pages in the same language. When I get around to it, I want to make sure that my 'bot creates new articles where necessary for plural and inflected forms (as is the convention in en), and also to deal with all those other relationships (if A is a synomym/rhyme/antonym for B, B should also have the same link to A).
Then I want to make a start on the categories, which are generally a mess.
I'm quite happy to work from the same codebase as you. Do you ever visit the IRC channel? I tend to be there from about 7-9am, UK time. Allan 17:39, 8 September 2005 (UTC)[reply]


Hi Allan,
Glad to see your comment on my talk page. In fact correcting links between entries is something I also want to do, eventually. Writing the parser proves hard enough though and it's a prerequisite, before we can take on the rest. In fact, when a page is read/parsed, many links are encountered. Links for synonyms, links for translations. If we manage to interpret them correctly and store them in objects, then it should become feasible to create a new page object with exactly the information you're after. A page that points back to the entry we were processing, containing all the relevant information. At the same time the current page can be cleaned up. If the secondary page already exists, we can start processing it the same way as the first, merging the information that was already there and the information we found on the primary page. If there are conflicts, the operator of the bot is notified about it.
Of course this process is recursive and we have to find a way to decide when a page is 'ready' to be written back. There are many problems to solve, and it will be necessary to do it all in small incremental steps.
Do you know something of Python? It's great programming language for this kind of stuff. It has a lot of possibilities and doesn't get in the way. Eventually it will become possible to write some kind of GUI to get a better overview of all that is going on. (Which entries are being worked on, what information has already been retrieved etc.)
I don't want to limit the gathering of data to one Wiktionary either. That's part of the reason why the parsing takes such a liberal approach. Due to the nature of duplication of information across different language Wiktionaries it will always be so that other language Wiktionaries contain different kinds of information than this one. Say the gender of a French word wasn't added. The bot can look up the word on the French Wiktionary and find it there. In case of a conflict, this can be resolved. The information gets more accurate, more complete and more homogeneous across projects.
Of course, getting the parser to actually correctly interpret the different styles used across projects is going to be a serious challenge.
But I think it's feasible, going one step at a time. For now I'm concentrating on the English Wiktionary format, then I'll tackle nl, followed by de, fr and es. That's where I'll probably stop. Other language's formats are going to have to be implemented by contributors of those Wiktionaries. I believe though that the foundation will be laid by then and that it will be mostly a matter of adding the language specific terminology.
I haven't been working with regular expressions yet. I'm sure some things will be more efficient with them, but I don't have a lot of experience with them yet.
Finally, since the information is all broken down and stored in objects, it will become possible to export to many other formats, be it XML or Ultimate Wiktionary or any of the many Wiktionary formats in existence. That though is really the easy part. Parsing the information and getting it into those object structures correctly. That's the challenge. Also, programming it in such a way that it backs off of information that really can't be parsed, is tough.
What I do for the moment is break down an entry at the header level. Some headers/sections will be left alone for the moment. It allows to work in a modular way and to concentrate on what I really want to do: definitions, synonyms and translations. Later on, parsing the pronunciation section etc, can be added. I don't think I will tackle the etymology section anytime soon. So those sections have to pass through untouched and their order shouldn't be changed in order not to disrupt the pages too much.
Anyway, I'm glad you're interested and I hope we can collaborate together. Each concentrating on our own purpose. I think our noses are pointing mostly in the same direction... Polyglot 18:53, 8 September 2005 (UTC)[reply]
Here's my list of rules that a bot can easily do, ignoring all the inter-language stuff which you are currently working on
  • Formatting (brackets and bolding) of quotations and inflected forms (e.g. putting the article name in bold, restrictive labels in brackets, interwiki links at the top)
  • Putting the parts-of-speech in alphabetical order
I'm not sure we want that. It was something I was planning to leave alone. I even store the order in which the parts of speech are encountered. In fact usually they are in a specific order for a reason.
Probably so. I tend to use WT:ELE as political insulation when considering what to change - the bot should move articles towards this document, which is the nearest we have to consensus in most areas. However, on this point I think we should probably disuss modification of ELE rather than modification of articles.
  • If A has a plural or inflected form of B, make sure that B references A. In the case of plurals, you can create a new article for B if one doesn't exist. This is harder for inflected forms, because it's not the done thing for 'bots to create incomplete entries (I see 'hang', the bot creates 'hanging = verb, participle of hang' but omits the noun meaning).
Incomplete entries are the way the wiki process works. It's not something to worry about. When a human contributor comes by, notices something is missing and actually feels like adding it, it will get added. It may take 20 years or more for some entries, but when there is something missing it will get added eventually.
I'm very nervous on this point - this issue was what killed attemmpt to import Webster. I'd like to consult more widely. (Maybe we could use content from Webster to fill in the gaps).
  • If A is a synonym, antonym, rhyme or "See also" for B, B should have the same relationship to A
I think this kind of things should be detected and then a question should be asked to the operator of the bot whether a change should be made or not.
Or maybe the operator reviews this list in bulk ahead of time, rather than spending time waiting for wiktionary servers.
  • Set phrases and idioms (e.g. come a cropper), should automatically be linked to from come and cropper
  • Where we can automatically generate categories (irregular verbs, irregular plurals, entries with usage notes), we can automate the creation of category links
  • Adding a {{rfc}} tag to stuff it really can't understand
I feel like leaving alone what the bot can not parse with relative surety. Or, ask the operator. In that sense it isn't really a bot, it's more like an automated cleanup tool. An assistant that takes away the grunt work and delegates the interesting work to the bot's operator.
Connel suggested a new 'not parseable' category. I think there is a service we can provide in identifying non-conforming articles (e.g. those that don't start "==English==" /, but we'd probably need to make this decision once we see the number of "false positives" the bot throws out.
  • At some point, standardize the way we deal with countable/uncountable and transitive/intransitive when a consensus for this is reached.
Yep
  • (I missed one). We can use stemming to interlink related words (e.g. accumulate to accumulation), possibly with a review step by a human.
Allan 19:22, 8 September 2005 (UTC)[reply]
I only added comments where I wanted to make a remark. Where there is no remark, I agree. Polyglot 20:24, 8 September 2005 (UTC)[reply]
I added comments to your comments Allan 06:37, 10 September 2005 (UTC)[reply]

I know nothing[edit]

Before doing this manually, I thought I would ask y'all if there was a bot that could replace {-en-} with ==English==. I found the following list: Special:Whatlinkshere/Template:-en-

Thanks,

--Stranger 03:19, 12 September 2005 (UTC)[reply]

Your account will be renamed[edit]

00:06, 18 March 2015 (UTC)

Renamed[edit]

07:23, 21 April 2015 (UTC)