Module talk:zh/data/yue-word

Cantodict data, can be used to add Cantonese pronunciations (~60000 words)

Wyang (talk) 03:11, 7 August 2014 (UTC)Reply

Snippet view

Word	Word (trad/simp)	Jyutping	Pinyin	Definition	Usage
香水	香水	hoeng1 seoi2	xiang1 shui3	perfume	Cantonese and Mandarin
水星	水星	seoi2 sing1		Mercury (Planet)	Cantonese and Mandarin
大風	大風	daai6 fung1	da4 feng1	strong wind, gale	Cantonese and Mandarin
水手	水手	seoi2 sau2	shui3 shou3	sailor, mariner	Cantonese and Mandarin
菜圃	菜圃	coi3 pou1		vegetable farm	Cantonese and Mandarin
大人	大人	daai6 jan4	da4 ren5/1		Cantonese and Mandarin
雪白	雪白	syut3 baak6	xue3 bai2	snowy white	Cantonese and Mandarin
灰白	灰白	fui1 baak6	hui1 bai2	grey, ashen	Cantonese and Mandarin
白天	白天	baak6 tin1	bai2 tian1	daytime	Cantonese and Mandarin
白白	白白	baak6 baak6	bai2 bai2	In vain, to no purpose, for nothing	Mandarin
結巴	結巴 / 结巴	git3 baa1	jie1 ba5	[v] to stutter; to stammer [adj/v/n] stammering; stuttering [n] stutterer; stammerer
雨水	雨水	jyu5 seoi2	yu3 shui3	rainwater; Rain Water, 2nd of the 24 solar terms 19th February-5th March	Cantonese and Mandarin
死水	死水	sei2 seoi1		stagnant water	Cantonese and Mandarin
水果	水果	seoi2 gwo2	shui3 guo3	Fruit. Cantonese speakers would normally say .	Cantonese and Mandarin
大家	大家	daai6 gaa1	da4 jia1	Everybody	Cantonese and Mandarin
大小	大小	daai6 siu2	da4 xiao3	size, proportions	Cantonese and Mandarin
大話	大話	daai6 waa6	da4 hua1		Cantonese and Mandarin

Cantodata

Latest comment: 9 years ago11 comments2 people in discussion

@Wyang How did you manage to extract it? Is Adam Sheik happy with it or did he give it himself to you? --Anatoli T. ^{(обсудить}/^вклад) 23:58, 7 August 2014 (UTC)Reply

CEDict data could also be used for Mandarin. They may be some occasional terms, which would not meet CFI. --Anatoli T. ^{(обсудить}/^вклад) 00:01, 8 August 2014 (UTC)Reply

I used the script at User:Wyang/code to extract it - it is still running and there are about 20,000 left (another 30 min). No, this doesn't have Adam Sheik's permission, which is why I only gave a snippet view of it. The pronunciation data (Jyutping) is uncopyrightable, and is potentially useful here. We can 1) use a bot to automatically add Cantonese pronunciations to existing Chinese entries; 2) make {{zh-new}} look up in this data whenever a new Chinese entry is created. Wyang (talk) 00:21, 8 August 2014 (UTC)Reply

Awesome. Very good idea. Please consider extracting CEDict as well, the whole dictionary could be used to create Chinese entries here. WWWJDIC or Edic data could similarly be used for Japanese. --Anatoli T. ^{(обсудить}/^вклад) 00:31, 8 August 2014 (UTC)Reply

They are now uploaded: Special:PrefixIndex/Module:zh/data/Jyutping_word. Wyang (talk) 03:42, 8 August 2014 (UTC)Reply

Thank you but for me it's easier to access the site directly. :) Not sure I'll personally be able to use these files. --Anatoli T. ^{(обсудить}/^вклад) 12:42, 8 August 2014 (UTC)Reply

I haven't incorporated these in the current infrastructure yet... Let me do the integration soon... Wyang (talk) 10:25, 9 August 2014 (UTC)Reply

Both utilities are done: [1], [2]. Wyang (talk) 01:16, 11 August 2014 (UTC)Reply

Great stuff!. However, I am bit concerned with editors not familiar with Cantonese automatically generating both Mandarin and Cantonese just relying on automatic Pinyin and Jyutping generation. Perhaps the automatic Jyutping should be a parameter, e.g. c=y? (Mandarin could be disabled by m=n for Cantonese only entries?) Also, not sure how you used to update Jyutping in existing entries, such as you did [3]. --Anatoli T. ^{(обсудить}/^вклад) 01:34, 11 August 2014 (UTC)Reply

I think the benefit of allowing automatic generation will be greater - it's quite hard to get it wrong. Both can be disabled by |m/c=-. I used the yue_for_bot function in Module:zh to do the latter. Wyang (talk) 01:41, 11 August 2014 (UTC)Reply

I think I understand it now. It actually checks Jyutping for the whole word and adds it only when it can find? It's great then! --Anatoli T. ^{(обсудить}/^вклад) 06:57, 11 August 2014 (UTC)Reply

New additions

Latest comment: 6 years ago8 comments4 people in discussion

@Justinrleung, Suzukaze-c, Atitarev, kc_kennylau (I know Justin is on 뷁 (bwek)...)

I incorporated the CC-Canto and CC-CEDICT Cantonese data into our existing Cantonese data today, so that now it has increased to >134,000 entries with pronunciation. Examples of the addition are this and this.

I'd love to know what you guys think. There are some errors in their data, especially in words with 多音字, and I'm not sure whether the new changes are worthwhile. There are two options, either we keep the new data and check all the incoming entries in Category:Cantonese lemmas to make sure the Cantonese pronunciation is correct; the other being we revert to the old revisions because the new data is too unreliable.

Wyang (talk) 13:50, 22 October 2017 (UTC)Reply

I'm not sure either. Maybe we can wait and see. —suzukaze (t・c) 03:40, 23 October 2017 (UTC)Reply

Most are good but not 100%. It seems the approach in both CC-Canto and CC-CEDICT is whatever works in Mandarin will work in Cantonese, even if a term is colloquial or even regional - the quality of the jyutping transliteration sometimes depends on the contributor who could have use a most frequent or random random and may not be noticed by others - native speakers or advanced learners. On the up side, Cantonese entries get checked at Wiktionary and it may not take a very long time for an error to be noticed. I don't have a strong objection but let's see what others think. An option is to add {{attention|zh|Cantonese jyutping needs to be checked}} for each automated entry. --Anatoli T. ^{(обсудить}/^вклад) 12:21, 23 October 2017 (UTC)Reply

If we're going to tag CC-Canto data with {{attention}} we might as well tag all uses of {{zh-new}}—sometimes I don't notice problems with pinyin, sometimes the Hokkien data has peculiarities, etc.... I don't think that approach is practical. The CantoDict data isn't perfect either; corrections from y- to j- (the /j/ initial) have been made a few times. —suzukaze (t・c) 21:06, 23 October 2017 (UTC)Reply

@Wyang, Atitarev, Suzukaze-c: (Kinda late, but I'm back from the break!) I think the data should be fine. There are just some problems such as not indicating tone change and using full-space commas. Overall, it's probably as problematic as Cantodict is. — justin(r)leung _{{ (t...) | c=› }} 04:20, 28 October 2017 (UTC)Reply

@Justinrleung, Suzukaze-c, Atitarev: Ok, thanks all. We will just keep an eye on it for now, and I will run some regular checks to try to catch the errors (like User:Wyang/yue-char-pron). (Btw, glad to see you are 백 (baek), Justin!) Wyang (talk) 05:07, 28 October 2017 (UTC)Reply

@Wyang Could Kaifang Cidian data be added as well? —suzukaze (t・c) 04:30, 11 November 2017 (UTC)Reply

@Suzukaze-c Yes, I will add it. Wyang (talk) 04:55, 11 November 2017 (UTC)Reply

Module talk:zh/data/yue-word

Snippet view

Cantodata

New additions

Navigation menu

Search