Wiktionary:Votes/bt-2006-07/Bot flag for Minnan-ascii-bot

From Wiktionary, the free dictionary
Jump to navigation Jump to search
Discussion moved from Wiktionary:Beer parlour/2006/July#Bot flag for Minnan-ascii-bot.

I would like to request a bot flag for Minnan-ascii-bot because Hippietrail has blocked the bot due to running a bot without a bot flag. A yao 23:50, 28 July 2006 (UTC)[reply]

To get a bot flag we have to approve your bot. To approve your bot we need to know what you want to achieve with it. Please make a proposal and it will be discussed and voted on. — Hippietrail 02:48, 29 July 2006 (UTC)[reply]

Examples:

a1 = a
a2 = á
a3 = à
a4 = ak
a5 = â
a7 = ā
a8 = a̍

Notes:

1. Number 4 is used only when it's last letters are h, k, p and t. 2. There is no number 6. A yao 07:23, 29 July 2006 (UTC)[reply]

  • I read A-yao's explanation on the Minnan-ascii-bot page. It sounds like a good idea. Two comments:
  1. Could the bot be expanded to a more general special characters to ascii converter? For example, I would like the same thing done for Pinyin as well.
  2. I would rather see the letter (ex. chhì-khòa-māi) converted to N instead of nn (ex. chhi3khoaN3mai7). I just think it looks better. This would also make it consistent with other on-line Min Nan dictionaries such as Tai-gi Hôa-gí sòaⁿ-téng sû-tián and Unicode Interface to the Holo-Mandarin Dictionary.

A-cai 07:32, 29 July 2006 (UTC)[reply]

Good idea but this is also a conflict in ASCII. A-cai, I think there should be a different bot for Pinyin. For example, àn in Minnan ASCII is an3 while in Pinyin, it is an4. A yao 07:46, 29 July 2006 (UTC)[reply]
I think a PinyinASCIIBot should be a separate bot, not combined with this one. --Connel MacKenzie 07:50, 29 July 2006 (UTC)[reply]
We don't normally wikify this type of entry, as there is no consistent way of entering them, is there? Aren't there several, conflicting methods of deriving these? --Connel MacKenzie 07:50, 29 July 2006 (UTC)[reply]
Not much conflict. Maybe you're describing N and nn. We can use N for . I agree. Any more comments? A yao 07:57, 29 July 2006 (UTC)[reply]
In the (now deleted) entries that I saw, there were many hyphens ("-") in the entry headwords. Were these correct, or should they have been spaces (" ") instead? --Connel MacKenzie 07:59, 29 July 2006 (UTC)[reply]
Here is how it is done at the Holo-Mandarin dictionary: If you type each syllable without tones, and separate the syllables with a hyphen (ex. sia-hoat), it will give you all possible combinations that match that (in this case: siá-hoat 寫法 and siâ-hoat 邪法). You can also narrow your search by adding numbers for tones (ex. sia2-hoat4 gives you siá-hoat 寫法). As a general rule, we use hyphens for Wade-Giles and POJ. For Pinyin, we cram the syllables together to delineate individual words (ex. in Mandarin 公共场所, Pinyin: gōnggòng chǎngsuǒ, Wade-Giles: kung1-kung4 ch'ang3-so3, public place).

A-cai 08:02, 29 July 2006 (UTC)[reply]

Also in the now deleted entries, I saw stuff like [[Category:ASCII Pe̍h-ōe-jī]] after the redirect. I think adding anything after the redirect syntax is problematic. Does that have the desired effect? --Connel MacKenzie 08:03, 29 July 2006 (UTC)[reply]
You can add as many categories as you want after redirects without breaking the redirect. This is a very useful feature. For example, I can create an entry for an idiom in Simplified Chinese and place it in the Category:zh-cn:Idioms category. If I so desire, I can create a matching Traditional Chinese entry that redirects to its simplified equivalent, but place it in the Category:zh-tw:Idioms category.

A-cai 08:09, 29 July 2006 (UTC)[reply]

Connel Mackenzie, in my experience in the Min Nan Wiktionary, there is problem. A yao 08:14, 29 July 2006 (UTC)[reply]
You meant to say "there is no problem" perhaps? --Connel MacKenzie 09:34, 29 July 2006 (UTC)[reply]
Yes. thank you for the correction. A yao
Would a language index be a better way of recording these "words"? Entries could look something like . .
bun5-gian5 - bûn-giân - 文言

but nicely formatted. People using the "Search" button for bun5 would then find the index entry. It would be a big index, but could be subdivided like all the others. SemperBlotto 08:58, 29 July 2006 (UTC)[reply]

No. It may be more complicated. A yao 09:16, 29 July 2006 (UTC)[reply]
You've already identified likely conflicts with similar Pinyin redirects. In all such cases, redirects are not the way to make un-ambiguous entries. That is why redirects are strongly discouraged on Wiktionaries. Also, we do not use the hyphens between syllables here. If they really are supposed to be non-word components, then it belongs in the Appendix: namespace, not the Index: namespace. But the redirects would be wrong, in either case, I think. --Connel MacKenzie 09:41, 29 July 2006 (UTC)[reply]
The behavior of extra "stuff" after a redirect is undefined. The category trick may work in this iteration of software, but may break in future versions. --Connel MacKenzie 09:41, 29 July 2006 (UTC)[reply]
The extraneous stuff after the category (repeating the headword) is also incorrect, BTW. Or were they simply meant as a test to make sure the bot was uploading them to the correct headword? --Connel MacKenzie 09:41, 29 July 2006 (UTC)[reply]
I think we're getting at the crux of the problem with Wiktionary in its current state. For foreign languages that use non-alphabetic scripts, we need a way to lookup words both phonetically (using some kind of "abc" romanization) as well as by using the script of the language. For example, to properly document the word for "dictionary" in Min Nan, Mandarin, Cantonese, Korean, Japanese and Vietnamese, we will ultimately need to the ability find it via each of the following written forms:
  1. Traditional Chinese: 辭典
  2. Simplified Chinese: 辞典
  3. Japanese shinjitai: same as Simplified Chinese in this case, but not always
  4. Mandarin Pinyin with diacritics: cídiǎn
  5. Mandarin Pinyin with tone numbers: ci2dian3
  6. Mandarin Pinyin without tones: cidian
  7. Min Nan POJ with diacritics: sû-tián
  8. Min Nan POJ with tone numbers: su5-tian2
  9. Min Nan POJ without tones: su-tian
  10. Cantonese Jyutping with tones: ci4 din2
  11. Cantonese Jyutping without tones: cidin
  12. Japanese hiragana: じてん
  13. Japanese romaji: jiten
  14. Korean hangul
  15. romanized Korean
  16. Vietnamese with diacritics
  17. Vietnamese with tone numbers
  18. Vietnamese without tone numbers

A yao is proposing to create a bot so that we don't have to create su5-tian2. If we create sû-tián, the bot would create su5-tian2 automatically, then redirect su5-tian2 to sû-tián. This works great on the Min Nan wiktionary (so far) because it has far fewer words in far fewer languages than English Wiktionary. The English Wiktionarian's current approach is to create a separate entry for each of the above (despite the fact that they are all cognates). However, because we are attempting to document all languages, we run into the problem of homonyms. So this is our challenge, but what is the best solution to all of this? Wiktionaryz looks promising, but only time will tell. One thing is for sure, as one person, I simply don't have the time or energy to manually create all of the required entries even for this one group of cognates! A-cai 10:54, 29 July 2006 (UTC)[reply]

In the general case, you are not going to be able to do this with redirects. You are going to have collisions with words in other languages with the same spelling. (bun) I think ASCII'ized versions, i.e. not what people write, but index keys, belong in a different name space. So sû-tián is in the main namespace, su5-tian2 belongs in an index. (People don't write su5-tian2, do they? I could be wrong here?) Robert Ullmann 12:11, 29 July 2006 (UTC)[reply]
In general, the only time you might write su5-tian2 is if you were going to look the word up in an on-line dictionary, since it is easier to type the numbers. Incidently, if I typed su5-tian2 in the search box, the Wiktionary software would not be able to find it if it were placed in an index. This is why A yao wants a separate entry for it.

A-cai 22:10, 29 July 2006 (UTC)[reply]

sawa, that is my point. "su5-tian2" is not a word, it is an indexing key. If we need better indexing or search boxen, fine. It isn't a word. Robert Ullmann 22:41, 29 July 2006 (UTC)[reply]
If someone creates sû-tián, then Minnan-ascii-bot make a redirect like su5-tian2, it goes to sû-tián. This happens in the Min Nan Wiktionary. A yao
The Computer began with ASCII, if I am correct. Then came the Codepages for the Europeans. So the Westerner do not have much problem in typing words directly through when searching for words they are searching.
Romanization for Asian laguages are useful for those who wants to learn these languages. That's where the Pinyin and POJ, for example, comes in. As A-cai have said, we use su5-tian2 when seaching for Min Nan words. The ASCII method helps minimize the time to open a Unicode editor, type the word, cut and paste at the seach bar, close the editor and then press [Go].
There are some issues, ok, let's try to find the solution.
  • Pinyin and POJ conflict. and not Asian ASCII (su5-tian2) is not a word I am at a fix if su5-tian2 is not a word, but there is still the issue of Pinyin/POJ conflict in a3. Can we use poj:a3 and py:a3 to clear the distinction with this problem? or maybe poj/a3 and py/a3 then. (I am not clear about this, but is this what you are proposing Mr. Ullmann?) If this is acceptable, then we will create another page on how to use this new method of seaching... let say something like "Wiktionary:Search Asian ASCII Method", or "Portal:Search Asian ASCII Method", as a howto search using pinyin, poj, romaji etc.
If the page for sû-tián had su5-tian2 on it somewhere (a very reasonable idea), then search for "su5-tian2" would do exactly what you want, without any extra entries in the main namespace. Robert Ullmann 14:48, 30 July 2006 (UTC)[reply]
  • Possible Future Redirect Problem With regards to the Category, we can do without it. But with the '''headword''' thing, Minnan-ASCII-bot uses the general Pagefromfile.py. I created a crude perl code to create the ASCII text from diacritic entries from the Language category and let the Pagefromfile.py process the redirects, I do not know how to rid of the headword, since I do not know how to correct it to suit my purpose. If anyone can recode the python script to use the namespace so that it will create the Page, and then discard the headword from the text, I would be very grateful.
    Regarding the headword:
    Some twit/vandal/wikipedian keeps mangling the documentation on meta: that I've corrected once or twice; perhaps it works different on wikipedias than it does on Wiktionaries. The '''headword''' must be on a line between the {{-end-}} and {{-start-}} instead of inside the content. I have found "pagesfromfile.py" sorely inadequate for most tasks, as multiple entries share the headword, very often. --Connel MacKenzie 06:19, 31 July 2006 (UTC)[reply]
Space instead of hyphen Each single word has its meaning, when connected with a hyphen, this compound word creates its own distinct word. It would be like looking at a text like "GODISNOWHERE" where it could be interpreted how you first looked at it, was it "GOD IS NOW HERE" or "GOD IS NOWHERE". er, yeah, not a good example, but it works for me ;) My point, hyphen makes the POJ entry clearer.
  • did I miss something?
Hiòng-êng 11:15, 30 July 2006 (UTC)[reply]
In addition, this bot is here for the purpose of making it easier to search words. If we need an index to look for words, we would go to the Category:Min Nan. I feel it is easier to type the ASCII word equivalent in the search box and find the word I'm looking for. Hiòng-êng 11:30, 30 July 2006 (UTC)[reply]
Regarding Hiòng-êng's suggestion, I can make it clearer. For example, a person would like to look for lâng, the Min Nan of person. To make the search easier. The person will just type lang5 to go to the entry instead of opening a Unicode editor to type lâng. A yao 11:45, 30 July 2006 (UTC)[reply]
Continuation to some issues...
  • The purpose -- the only purpose -- of a redirect is to connect variant forms of the same word I think that this is not the only purpose. Redirect is also used as a way to ease typing. ex. WT:BP Would you please reconsider my proposal earlier?
-- Hiòng-êng 10:05, 1 August 2006 (UTC)[reply]
The shorthand WT:BP is not in the main namespace, and would not be used in the main namespace. All entries in the main namespace are words in use, written as they are written in use. So we do have Initialisms DNS for example, because, and only because, people actually write "DNS".

If you are having trouble entering diacriticals or something, get a better keyboard driver, and learn how to type them. (I know, Windows makes this painful ... if you are using the US keyboard, load "US International".) If you are using another operating system, you have lots of options. How do you write in Min Nan anyway? Surely you don't use a "Unicode editor" all the time?! Ouch. Robert Ullmann 19:49, 1 August 2006 (UTC)[reply]

I have to disagree with your above statement about diacritics. Obviously, if a person is an advanced user, your advice would be appropriate. However, we should also be trying to accommodate beginning users who may not be familiar with how to configure their computer to type with diacritics. Keep in mind that POJ Romanization is not universally used by Min Nan speakers. In fact, a large number of Min Nan speakers use Chinese characters to write Min Nan, if they write it at all (Min Nan speakers are usually bilingual). Min Nan is still primarily a spoken language. We therefore need a simple way for Min Nan speakers to lookup words WITHOUT having to go to extraordinary lengths such as learning a new input method or reconfiguring their operating system. Other on-line Min Nan and Mandarin dictionaries already routinely offer this feature. If we want to be competitive with other on-line dictionaries, we need to be thinking about this.

A-cai 22:24, 1 August 2006 (UTC)[reply]

Mr. Robert Ullmann, how did you know I am always using a "Unicode Editor"? :) Yes, but take note, In Yudit, I have created an input system for my Min Nan entries, I am not using the unicode number to enter the diacritics. Back to the topic at hand. My point is in line with A-cai, which is to accommodate the beginners. If redirect is a no-no with the main namespace, maybe we could ask for some "romanization namespace" such as poj or pinyin perhaps, just as I suggested earlier.
Hiòng-êng 00:38, 2 August 2006 (UTC)[reply]