Module talk:zh/data/yue-word

From Wiktionary, the free dictionary
Latest comment: 6 years ago by Wyang in topic New additions
Jump to navigation Jump to search

Cantodict data, can be used to add Cantonese pronunciations (~60000 words)

Wyang (talk) 03:11, 7 August 2014 (UTC)Reply

Snippet view

[edit]
Word Word (trad/simp) Jyutping Pinyin Definition Usage
香水 香水 hoeng1 seoi2 xiang1 shui3 perfume Cantonese and Mandarin
水星 水星 seoi2 sing1 Mercury (Planet) Cantonese and Mandarin
大風 大風 daai6 fung1 da4 feng1 strong wind, gale Cantonese and Mandarin
水手 水手 seoi2 sau2 shui3 shou3 sailor, mariner Cantonese and Mandarin
菜圃 菜圃 coi3 pou1 vegetable farm Cantonese and Mandarin
大人 大人 daai6 jan4 da4 ren5/1 Cantonese and Mandarin
雪白 雪白 syut3 baak6 xue3 bai2 snowy white Cantonese and Mandarin
灰白 灰白 fui1 baak6 hui1 bai2 grey, ashen Cantonese and Mandarin
白天 白天 baak6 tin1 bai2 tian1 daytime Cantonese and Mandarin
白白 白白 baak6 baak6 bai2 bai2 In vain, to no purpose, for nothing Mandarin
結巴 結巴 / 结巴 git3 baa1 jie1 ba5 [v] to stutter; to stammer [adj/v/n] stammering; stuttering [n] stutterer; stammerer
雨水 雨水 jyu5 seoi2 yu3 shui3 rainwater; Rain Water, 2nd of the 24 solar terms 19th February-5th March Cantonese and Mandarin
死水 死水 sei2 seoi1 stagnant water Cantonese and Mandarin
水果 水果 seoi2 gwo2 shui3 guo3 Fruit. Cantonese speakers would normally say . Cantonese and Mandarin
大家 大家 daai6 gaa1 da4 jia1 Everybody Cantonese and Mandarin
大小 大小 daai6 siu2 da4 xiao3 size, proportions Cantonese and Mandarin
大話 大話 daai6 waa6 da4 hua1 Cantonese and Mandarin

Cantodata

[edit]

@Wyang How did you manage to extract it? Is Adam Sheik happy with it or did he give it himself to you? --Anatoli T. (обсудить/вклад) 23:58, 7 August 2014 (UTC)Reply

CEDict data could also be used for Mandarin. They may be some occasional terms, which would not meet CFI. --Anatoli T. (обсудить/вклад) 00:01, 8 August 2014 (UTC)Reply
I used the script at User:Wyang/code to extract it - it is still running and there are about 20,000 left (another 30 min). No, this doesn't have Adam Sheik's permission, which is why I only gave a snippet view of it. The pronunciation data (Jyutping) is uncopyrightable, and is potentially useful here. We can 1) use a bot to automatically add Cantonese pronunciations to existing Chinese entries; 2) make {{zh-new}} look up in this data whenever a new Chinese entry is created. Wyang (talk) 00:21, 8 August 2014 (UTC)Reply
Awesome. Very good idea. Please consider extracting CEDict as well, the whole dictionary could be used to create Chinese entries here. WWWJDIC or Edic data could similarly be used for Japanese. --Anatoli T. (обсудить/вклад) 00:31, 8 August 2014 (UTC)Reply
They are now uploaded: Special:PrefixIndex/Module:zh/data/Jyutping_word. Wyang (talk) 03:42, 8 August 2014 (UTC)Reply
Thank you but for me it's easier to access the site directly. :) Not sure I'll personally be able to use these files. --Anatoli T. (обсудить/вклад) 12:42, 8 August 2014 (UTC)Reply
I haven't incorporated these in the current infrastructure yet... Let me do the integration soon... Wyang (talk) 10:25, 9 August 2014 (UTC)Reply
Both utilities are done: [1], [2]. Wyang (talk) 01:16, 11 August 2014 (UTC)Reply
Great stuff!. However, I am bit concerned with editors not familiar with Cantonese automatically generating both Mandarin and Cantonese just relying on automatic Pinyin and Jyutping generation. Perhaps the automatic Jyutping should be a parameter, e.g. c=y? (Mandarin could be disabled by m=n for Cantonese only entries?) Also, not sure how you used to update Jyutping in existing entries, such as you did [3]. --Anatoli T. (обсудить/вклад) 01:34, 11 August 2014 (UTC)Reply
I think the benefit of allowing automatic generation will be greater - it's quite hard to get it wrong. Both can be disabled by |m/c=-. I used the yue_for_bot function in Module:zh to do the latter. Wyang (talk) 01:41, 11 August 2014 (UTC)Reply
I think I understand it now. It actually checks Jyutping for the whole word and adds it only when it can find? It's great then! --Anatoli T. (обсудить/вклад) 06:57, 11 August 2014 (UTC)Reply

New additions

[edit]

@Justinrleung, Suzukaze-c, Atitarev, kc_kennylau (I know Justin is on (bwek)...)

I incorporated the CC-Canto and CC-CEDICT Cantonese data into our existing Cantonese data today, so that now it has increased to >134,000 entries with pronunciation. Examples of the addition are this and this.

I'd love to know what you guys think. There are some errors in their data, especially in words with 多音字, and I'm not sure whether the new changes are worthwhile. There are two options, either we keep the new data and check all the incoming entries in Category:Cantonese lemmas to make sure the Cantonese pronunciation is correct; the other being we revert to the old revisions because the new data is too unreliable.

Wyang (talk) 13:50, 22 October 2017 (UTC)Reply

I'm not sure either. Maybe we can wait and see. —suzukaze (tc) 03:40, 23 October 2017 (UTC)Reply
Most are good but not 100%. It seems the approach in both CC-Canto and CC-CEDICT is whatever works in Mandarin will work in Cantonese, even if a term is colloquial or even regional - the quality of the jyutping transliteration sometimes depends on the contributor who could have use a most frequent or random random and may not be noticed by others - native speakers or advanced learners. On the up side, Cantonese entries get checked at Wiktionary and it may not take a very long time for an error to be noticed. I don't have a strong objection but let's see what others think. An option is to add {{attention|zh|Cantonese jyutping needs to be checked}} for each automated entry. --Anatoli T. (обсудить/вклад) 12:21, 23 October 2017 (UTC)Reply
If we're going to tag CC-Canto data with {{attention}} we might as well tag all uses of {{zh-new}}—sometimes I don't notice problems with pinyin, sometimes the Hokkien data has peculiarities, etc.... I don't think that approach is practical. The CantoDict data isn't perfect either; corrections from y- to j- (the /j/ initial) have been made a few times. —suzukaze (tc) 21:06, 23 October 2017 (UTC)Reply
@Wyang, Atitarev, Suzukaze-c: (Kinda late, but I'm back from the break!) I think the data should be fine. There are just some problems such as not indicating tone change and using full-space commas. Overall, it's probably as problematic as Cantodict is. — justin(r)leung (t...) | c=› } 04:20, 28 October 2017 (UTC)Reply
@Justinrleung, Suzukaze-c, Atitarev: Ok, thanks all. We will just keep an eye on it for now, and I will run some regular checks to try to catch the errors (like User:Wyang/yue-char-pron). (Btw, glad to see you are (baek), Justin!) Wyang (talk) 05:07, 28 October 2017 (UTC)Reply

@Wyang Could Kaifang Cidian data be added as well? —suzukaze (tc) 04:30, 11 November 2017 (UTC)Reply

@Suzukaze-c Yes, I will add it. Wyang (talk) 04:55, 11 November 2017 (UTC)Reply