Module talk:User:Kiril kovachev/bg-pronunciation/testcases

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Syllabification failing tests

[edit]

@Kiril kovachev 4 of the 5 syllabification tests currently failing should actually pass:

  • после --> по-сле, because the "сл" cluster is fricative (lower sonority) + sonorant (higher sonority). This also maximizes the number of open syllables - 2, instead of 1 if we had split it into пос-ле. The same argument holds for the "най-после" test case.
  • шам-фъстък --> the cluster is "ст", which is fricative + stop, and it's one of the "ambiguous" cases where you could either keep it or split it. Because we treat stops as having higher sonority than fricatives, we keep it. Again, this results in a syllabification with at least one open syllable - фъ-стък - which is the maximum attainable for this word.
  • късно --> the cluster has the same structure as the one in после, except the sonorant is "н".

The only iffy syllabification is зе-ни-тно. Although it happens to maximize the number of open syllables (3), the "тн" cluster is non-existent word-initially, so the proper syllabification should, indeed, have been: зе-нит-но. I think we just need to add "тн" to the list of clusters to break, even though they're ordered by sonority. In this case, it's true that sonority(т) < sonority(н).

Chernorizets (talk) 22:49, 27 August 2023 (UTC)Reply

@Chernorizets Thanks kindly for the overview. I am unfortunately a little weak on the understanding of the sonority rules themselves, and the nuances you're evidently very well-versed in sadly evade me :'). What is good to know is that this isn't a big problem; I'll add the тн cluster to the ones to break, and then fix the test cases. This is looking bright as ever for the accuracy side of things! Kiril kovachev (talkcontribs) 23:25, 27 August 2023 (UTC)Reply
@Kiril kovachev the sonority hierarchy presented in the textbook goes like this, from lowest to highest:
  • fricatives - с, з, ш, ж, в, ф, х
  • stops and affricates - б, п, г, к, д, т, ч, ц, дж, (дз)
  • sonorants - л, м, н, р, й, (/w/)
  • vowels - а, ъ, о, у, е, и
This isn't perfect, and in fact the Sofia University prof mentioned it's not always true, but it's truer than not enough of the time. Chernorizets (talk) 02:10, 28 August 2023 (UTC)Reply
@Chernorizets I see, and we tend to break syllables when the sonority stops increasing, right? Kiril kovachev (talkcontribs) 00:29, 29 August 2023 (UTC)Reply
@Kiril kovachev right. As you know, we also maintain a list of exceptional clusters where sonority does increase according to this scheme, but which we still want to break up.
The other exception is related to handling morphological prefixes, as in възможност: въз-мо-жност vs въ-змо-жност. According to the SU prof, we might be a bit overzealous with our support for prefixes at the expense of maximizing open syllables, but the "зм" cluster is rare word-initially (змей, змия and derivatives, can't think of others), and according to him that's a good reason to break up a cluster. Technically the same is true for "жн", so another syllabification could've been въз-мож-ност (no open syllables) or въ-змож-ност (one open syllable). Personally, the no-open-syllable version is what seems most natural to me, but that's irrelevant - we should just have a consistent set of rules that we apply to each word. If we say "break up clusters with rising sonority which rarely occur word-initially", then we define what "rarely" means and we do that. If we say "maximize open syllables, even if that results in syllable onsets that are rare word-initially", then we do that. The number of syllables doesn't change either way.
P.S. I wrote a little program that creates a frequency-ordered list of word-initial consonant clusters based on the Bulgarian National Corpus's frequency-ordered wordlist. The most frequent one is "пр" (42441 occurrences). "зм" has 237, "жн" has 40. Chernorizets (talk) 02:27, 29 August 2023 (UTC)Reply
@Chernorizets I was going to suggest that we manually override syllabifications if we deem there to be a more natural, equally-correct syllabification to replace them, like in the example you raised above, but this could also harm standardization; what do you think? Kiril kovachev (talkcontribs) 11:56, 30 August 2023 (UTC)Reply
By the way, this is likely entirely hypothetical, but I noticed we already have the voiced version of this cluster, дн, in the to-break exceptions: does this logic of splitting depend on the voicedness of the consonants, or not? It's my imagination that we'd also need to break фн, фм, etc. which are equivalent to вм, вм, but with unvoiced sounds. Needless to say that this is unlikely to make any big difference, the relative scarcity of ф in mind, and in fact may be wrong to do altogether, but what do you think? Kiril kovachev (talkcontribs) 23:30, 27 August 2023 (UTC)Reply
@Kiril kovachev the to-break exception list is best-effort. My guiding principle there was to select commonly-occurring clusters, as checked in a representative frequency-ordered wordlist (I was using the Bulgarian National Corpus). Those were the ones I could think about at the time. There might be other ones that would need to be added, and the list allows for that. I've been adding some targeted tests recently for these clusters to see if any of them should actually be removed from the list. The ones with "в" are particularly interesting.
As for "фн" and "фм", the first several most common words appear to be: цъфна, разцъфна, релефно, нискотарифни, антропоморфни, аморфност, апокрифно and their other non-lemma forms. Outside of the first 3-4, the others aren't all that common, so use your judgment. "фм" in particular is rare - primarily personal names (Хофман, Кауфман) and loanwords (шлайфмашина, хофмаршал, цефметазол). Chernorizets (talk) 02:24, 28 August 2023 (UTC)Reply
@Chernorizets Right, that pretty much aligns with expectations. In the case of цефметазол, is the correct syllabification це.фме.та.зол? Or цеф.ме.та.зол? I guess it's the former, as this would increase the number of open syllables? But that depends on whether фм is valid as a syllable-initial. What sequences are actually forbidden overall? Kiril kovachev (talkcontribs) 00:33, 29 August 2023 (UTC)Reply
@Kiril kovachev for "forbidden" sequences, I'd refer to the syllable-breaking rules from the textbook (translated in the README file in my prototype). E.g. in a cluster of 2 consonants, if they are both stops or both sonorants, they should be split. In other cases, I wouldn't think of it in terms of "forbidden", but rather in terms of "naturalness". One way to define naturalness for a syllable - especially one in the middle of the word - is to say that its onset (the consonants before the vowel) is not "rare", and one way to define "rare" is "occurring word-initially at a low frequency". There are no Bulgarian words that start with "фм", so it's rare by that definition, and therefore you wouldn't want it to start a syllable. This would favor the syllabification цеф-ме-та-зол. Chernorizets (talk) 07:04, 29 August 2023 (UTC)Reply
@Chernorizets Right, that makes sense. Thanks again for the detailed explanation. :) Kiril kovachev (talkcontribs) 11:55, 30 August 2023 (UTC)Reply