Module talk:bg-pronunciation

From Wiktionary, the free dictionary
Latest comment: 4 months ago by Kiril kovachev in topic Syllabification from hyphenation
Jump to navigation Jump to search

-та and -ят[edit]

@Benwing2

  1. According to my grammar reference, the final stressed -та́ in definite forms is merged with the previous /t/ as in нощ->нощта́, ра́дост->радостта́
  2. The final -ят is pronounced as -'ът, as in ден->деня́т.

I couldn't confirm that feminine -та́ should be -тъ́ as in the forum and in the video I gave on your talk page. It could be individual or colloquial. --Anatoli T. (обсудить/вклад) 05:17, 17 March 2020 (UTC)Reply

The pronunciation of -та is historically problematic, because it doesn't reflect a singular form. It stands for both the nominative -та (-ta) /-tɐ/ and the accusative -тѫ (-tǫ) /-tɤ/. Nowadays, these are written the same in Standard Bulgarian and people no longer make grammatical distinction between the two forms. In some dialects, e.g. some archaic Danubian (Northwestern) and Šop/Torlak dialects, the two are still distinguished. Note that in these dialects -ѫ is pronounced more like -у /ʊ/, which makes the difference both grammatical and phonetic. For this reason, it has survived to the present day. Безименен 14:54, 31 March 2020 (UTC)Reply
The merging of -т- after щ, т is subjective and occasion dependent. It is certainly merged in fast or colloquial speech, but I think officially it's advisable to geminate the -t- in slow formal speech. Not sure though Безименен 15:01, 31 March 2020 (UTC)Reply

я and ю[edit]

@Benwing2, Guldrelokk: Hi, I've added a few cases я and ю when they don't follow consonants, expecting a /j/, not /ʲ/. --Anatoli T. (обсудить/вклад) 22:36, 18 March 2020 (UTC)Reply

fronting between palatal(ized) consonants[edit]

@Bezimenen, Bogorm, Atitarev I fixed the module to use [j] word initially for я/ю and changed the fronting logic for [a] and [u] to also apply between [j] (as it does in Russian), but maybe the latter isn't correct as the test case for яйцо́ uses [ɐ] not [æ]. What is the exact logic for fronting [a] -> [æ], [u] -> [ʉ]? Benwing2 (talk) 22:06, 29 March 2020 (UTC)Reply

@Benwing2: I think jæjt͡ˈsɛ is good - I've changed my test case. Thanks for fixing. --Anatoli T. (обсудить/вклад) 01:06, 30 March 2020 (UTC)Reply

магази́н (magazín)[edit]

@Bezimenen, Bogorm, Benwing2, Ted Masters: Hi. Is the pronunciation of both unstressed vowels - [məɡɐˈzin] accurate?

It looks like the module is using the Russian reduction of the unstressed "а". --Anatoli T. (обсудить/вклад) 00:30, 1 April 2020 (UTC)Reply

@Atitarev, Benwing2: The pronunciation of магази́н (magazín) is correct — it is pronounced the same way in Bulgarian as it is in Russian. I'm guessing that's because the word have entered both languages through German (or possibly French). --Ted Masters (talk) 15:36, 2 April 2020 (UTC)Reply
@Ted Masters, Bezimenen, Bogorm, Benwing2 Thanks, @Ted Masters! My question is about the pronunciation of the unstressed "а". Russian: [məɡɐˈzʲin]. I think th eBulgarian should be [mɐɡɐˈzin], rather than [məɡɐˈzin], as it was in this revision. --Anatoli T. (обсудить/вклад) 22:53, 2 April 2020 (UTC)Reply
@Atitarev: To my knowledge, both [ɐ] and [ə] could be used to transcribe unstressed “а” or “ъ” in IPA (a near-open central vowel?) for Bulgarian. And to me, the first unstressed “а” sounds more squeezed (flat, unrounded; kinda like the unstressed “а” in къща (kǎšta) [kɤʃtə]), so [ə] would be correct, I think, when transcribing that sound; while the second “а” sounds a bit more rounded, hence the [ɐ]. Or maybe my hearing is bad. And there is a count form for магазин — магазина, same as definite (obj.). Also it doesn't have a vocative form. --Ted Masters (talk) 13:37, 3 April 2020 (UTC)Reply
@Ted Masters: Thanks for your response! We will address the vowel reduction more seriously later. I have removed the vocative. --Anatoli T. (обсудить/вклад) 21:41, 3 April 2020 (UTC)Reply

Fixed issues[edit]

(Notifying Atitarev, Bogorm, Bezimenen, Nauka, Ted Masters): I fixed the remaining issues with the testcases; all are passing now. Please feel free to add more testcases. Benwing2 (talk) 06:15, 12 April 2020 (UTC)Reply

Prepositions and prefixes[edit]

Inspired by Talk:свръхякане (Notifying Benwing2, Bogorm, Bezimenen, Nauka, Ted Masters): Hello all,

According to my listening to 35 Minutes of Bulgarian Listening Comprehension for Beginner] and some other readings, I think the rules are as follows, I will just use some examples:

  1. в о́фиса = фо́фиса
  2. в Укра́йна = фукра́йна
  3. във вто́рник (= въф фто́рник) = въфто́рник
  4. във Фра́нция (= въф Фра́нция) = въФра́нция
  5. в Ю́жна Бълга́рия = фЮ́жна Бълга́рия
  6. със снимка = съсни́мка
  7. в източното крило = фисточното крило

Please comment. @Kiril kovachev, you're welcome to join. We still have to resolve the question at the talk page of свръхя́кане (svrǎhjákane). --Anatoli T. (обсудить/вклад) 02:15, 18 April 2020 (UTC)Reply

Accuracy[edit]

@Benwing2 Hi, I've been working with User:Kylebgorman on WikiPron, a multilingual pronunciation dataset that draws from Wiktionary. I am wondering, is this module reliable enough to be automatically placed on entries? If it isn't, there's some bot-based fixes I'd like to make:

  • Merging all /l/ to /ɫ/ (since they are allophones)
  • Merging all /d̪, t̪/ to /d, t/ (since it's just another way to represent them, and Bulgarian doesn't seem to contrast dental vs. alveolar)
  • Fixing cases where ц is written /ts/ not /t͡s/ (this may be a little harder since I have to figure out alignment)

These have been verified by Bulgarian native speakers. I figured I should ask if the module is ready for automatic deployment, since that would avoid these bot fixes. —AryamanA (मुझसे बात करेंयोगदान) 23:28, 3 August 2020 (UTC)Reply

@AryamanA, Atitarev I think the module is probably accurate enough. It needs a bit of help to properly determine how to pronounce certain final vowels, but that can be automated by bot. If there are problems with the module, we should fix it and not resort to manual pronunciations. I was planning at some point to do a bot run to convert manual pronunciations to automatic ones. This can't just be done by erasing the manual ones. Instead, you have to compare the manual to the automatic, making allowances for certain issues like the ones you mention above, and only convert manual to automatic if they match. The remaining cases have to be verified by hand. (I did this for Russian with Anatoli's help a few years ago.) Benwing2 (talk) 00:13, 4 August 2020 (UTC)Reply
@AryamanA BTW I see the module has some problems with adding vowel reductions in secondarily-stressed syllables. I'll see if I can fix that tonight. Benwing2 (talk) 00:15, 4 August 2020 (UTC)Reply
@AryamanA: /ɫ/ should only be used where appropriate. For example, учи́лище (učílište) is [uˈt͡ʃilʲiʃtɛ]. I agree with User:Benwing2's approach. We need to analyse and understand the manual transcriptions, they can replaced with the automated ones if they automated match, allowing for allophones. --Anatoli T. (обсудить/вклад) 01:03, 4 August 2020 (UTC)Reply
@Atitarev, Benwing2: Ha, thank you for pinging Anatoli, slipped my mind. Yeah, I see now the situation is trickier for phonetic transcription; it seems only the last two bullet points are automatable then--and maybe it would be better to wait for the module w/ manual intervention as a better overall solution. Let me know if I can help in any other way. —AryamanA (मुझसे बात करेंयोगदान) 02:16, 4 August 2020 (UTC)Reply
@AryamanA: I support the full automation of Bulgarian pronunciations, there are a few corner cases, like reductions of "а" in both stressed and unstressed positions. We need more involvement of native speakers who should know better when the automation fails or a phonetic respelling is required. ::::: BTW, Hindi could use modularisations for inflections and something else, like nuqtaless forms. Inflections are straightforward and can be fully automated, IMO. There's also interest in the Hindi Wiktionary but I am clueless about how to import all important modules and templates to make them work there. --Anatoli T. (обсудить/вклад) 02:24, 4 August 2020 (UTC)Reply
@Atitarev: Of course, been working on Module:hi-conj, will get around to it all eventually. —AryamanA (मुझसे बात करेंयोगदान) 04:51, 4 August 2020 (UTC)Reply

Stress placement and morphemic boundaries[edit]

@Bezimenen, Benwing2 The module places the stress mark without regard for morphemic boundaries, as in избия (izbija), where it appears before the /z/ even though the prefix is "из-". Since the it appears between the prefix and the root in manually added pronunciations, e.g. at избира (izbira) and избождам (izboždam), I am assuming that the transcription for избия (izbija) is wrong. Could Bulgarian users confirm that this is the case? If so, since the module cannot know where morpheme boundaries lie, manual respellings would be required for words like избия (izbija), e.g. {{bg-IPA|изˈбия}} > IPA(key): [izˈbijɐ]. Martin123xyz (talk) 08:20, 25 August 2021 (UTC)Reply

@Martin123xyz: I believe that is because the pronunciation module accounts how words are split during pronunciation, not morphologically. Slavic languages in general have a tendency towards rising sonority, hence, when possible, speakres try to open up syllables (i.e. to end them in a vowel or in a syllabic resonant). In the cases, when a morpheme ends in a fricative (such as /z/), the way to do that is by "shifting" the fricative from the coda of the morpheme to the onset of the next syllable. Since fricatives are near the bottom of the sonority scale, this normally adjusts the sonority in a rising trajectory. If you will, think how one would pronounce гозба. Morphologically, it should be го́-сть-ба but phonetically it is split as го́·зба. Безименен (talk) 14:47, 26 August 2021 (UTC)Reply
@Bezimenen: Thank you for the answer. I understand the principle of onset maximization and of rising sonority in syllable rhymes but I find it surprising that speakers would do this even where the morphology is relatively transparent. For me, the most natural way to syllabify гозба is гоз-ба, but perhaps I have been studying theoretical linguistics for long enough for my awareness of words' etymology to override my native speaker intuitions. However, I imagine that different speakers would have different tendencies regardless of linguistic education. In any case, it remains that standard Macedonian recommends splitting words with an internal consonant cluster in such a way as to have a consonant in each syllable, e.g. гоз-ба. Based on what I have gleaned so far, the same seems to apply to Bulgarian, as demonstrated by material for little children:
* Сричките в думи - evidently, "мишка" and "въртележка" are meant to be syllabified as "миш-ка" and "вър-те-леж-ка" without transferring the fricative to the following syllable. Arguably this may be necessary to simplify things for first-graders, but if that is what they are taught at such a young age, I was wondering whether they don't end up perceiving syllable structure that way as adults.
There is also Словоред which splits syllables at morphemic boundaries, and the same is recommended for transferring syllables to the next line typographically (пренасяне на думите на нов ред), notably in rule 3. Admittedly, this typographic practice is only partly related to phonological syllable structure.
On the other hand, this considerably more serious source syllabifies words as you suggest, shifting as much as possible to the onset of the next syllable, as in "какво" on p. 109. It cites "Граматика на съвременния български книжовен език" at the start of the section, but I am yet to find this work. Perhaps you could share what it says about syllable boundaries and add some more sources? What do @Ted Masters and @Atitarev think?

Martin123xyz (talk) 07:40, 27 August 2021 (UTC)Reply

@Martin123xyz: In principle, nothing of what you have written is wrong, but I think you are confusing the rules for hyphenation of words in writing with those in pronunciation. Note that Literary Bulgarian tends to have morphological orthography. Typically, though, people don't follow the morphological principle in speech (unless they want to sound hyperpunctual). For example, the mentioned въртележка is normally pronounced вър·те·ле·шка. The suggested syllabifications that you have found are what one should apply in writing specifically.
In regard to relevant literature, Граматика на съвременния български книжовен език, Том 1 deals with phonology. The hyphenation of words is addressed in section 128 (p. 168 - 170). I can't copy all the text, but here is a relavent portion of what the authors say:

Основната причина за трудността на сричкоделенето се състои в това, че членението на думата на срички не е свързано със смисловото и граматичното значение на думата, както е при членението на морфеми. Между сричката и морфемата няма лингвистично съответсвие. Членението на двете единици се подчинява на различни принципи. Затова напр. думата родители се члени на морфеми род-и-тел-и, а на срички - ро-ди-те-ли.

Later, there are rules that should be followed and it is also mentioned that:

Представките и надставките не запазват своята морфологична самостоятелност [...]

It is also said that splitting of consonant clusters permit alternatives. The example гозба that I gave ealier is probably one of them.
NB Bulgarian сричка (srička), дума (duma), членение (členenie) correspond to Macedonian слог (slog), збор (zbor), поделба (podelba). Безименен (talk) 12:49, 27 August 2021 (UTC)Reply
@Bezimenen Thank you for the detailed answer. I think that what you quote from Граматика на съвременния български книжовен език is pretty conclusive and justifies the transcription of избия (izbija) as it currently stands. In any case, I think that this discussion will be instructive for other users who may also wonder about the relationship between syllables and morphemes, especially Bulgarian editors who might want to add the template to new entries and override the automatic transcriptions, thus producing inconsistencies like the one that currently exists between избия (izbija) and избира (izbira). Martin123xyz (talk) 13:04, 27 August 2021 (UTC)Reply

Velarized "l", орле[edit]

@Bezimenen, Benwing2 As far as I know, the Bulgarian lateral is velarized (and contemporarily vocalized into /w/) in all positions except before front vowels. Since in орле, it occurs before an "е", I believe that the pronunciation should be [orˈlɛ] or even [orˈlʲɛ]. An academic source which acknowledges that the lateral is not velarized before front vowels would be the following:

Another article by the same authors states: "не включваме също така веларния алофон [ł], типичен за западните говори (разложко-бански говор – мЛеко, Лек и т.н.". Here, they are suggesting that [ł] is dialectal before the front vowels, rather than in general.

The allophones [ł] and [l] belong to the same phoneme, so they can be transcribed in the same way if words are being rendered phonemically, but based on the treatment of училище and the use of square brackets rather than slashes, they are being rendered phonetically instead. I therefore believe that орле, лек etc. need to be fixed. Martin123xyz (talk) 10:09, 27 August 2021 (UTC)Reply

Recent changes[edit]

I forgot to explain my recent changes to the module in the description, so I thought I'd do it here instead. Many transcriptions were straight up incorrect and belonged to Russian, not Bulgarian. Reduction, palatalization, vowel qualities were all messed up and heavily russianized for whatever reason. Second, the module lacked several features, such as distinguishing soft and palatalized l, elision of n preceding fricatives, etc. This update not only fixes these mistakes, but also presents a less conservative analysis of the Bulgarian language that suits it better as its own language, as opposed to just a branch of the Slavic family. ChirumiruChirumiruGenkiChirumiru (talk) 14:00, 24 June 2022 (UTC)Reply

Differentiation between [l] and [lʲ][edit]

The current module doesn't seem to differеntiate between [l] and [lʲ] and implies that the L-sound in front of "е" and "и" is pronounced the same way as in front of "я", "ю" and "ьо". While this is the case in Russian, and might be the case in certain Eastern Bulgarian dialects, the majoity of Bulgarian speakers pronounce "ле" and "ли" [lɛ] and [li] and not [lʲɛ] and [lʲi]. Ntilev (talk) 12:32, 27 April 2023 (UTC)Reply

I second that. Unless someone could provide links to papers demonstrating an analysis of [l] before front vowels as [lʲ] in the standard language, it produces unnatural IPA transcriptions for many Bulgarian words. Chernorizets (talk) 08:16, 13 July 2023 (UTC)Reply

[lʲ] before "е" and "и" is incorrect in standard Bulgarian[edit]

The IPA symbols [lʲ] and [ʎ] denote the voiced palatal lateral approximant, which in Bulgarian occurs only before "я", "ю" and "ьо". Per the Bulgarian phonology Wikipedia article:

 /l/ varies: one of its allophones, involving a raising of the back of the tongue and a lowering of its middle part (thus similar or, according to some scholars, identical to a velarized lateral), occurs in all positions, except before the vowels /i/ and /ɛ/, where a more "clear" version with a slight raising of the middle part of the tongue occurs. The latter pre-front realization is traditionally called "soft l" (though it is not phonetically palatalized). In some Western Bulgarian dialects, this allophonic variation does not exist.
 

As far as I can tell, the following code line needs to change from:

term = rsub(term, "([kɡxl])([ieɛ])", "%1ʲ%2")

to:

term = rsub(term, "([kɡx])([ieɛ])", "%1ʲ%2")

Chernorizets (talk) 08:10, 20 July 2023 (UTC)Reply

Non-exhaustive list of problem areas with the module's rules[edit]

@Benwing2, per our discussion on your talk page, here are some issues I've identified with the Bulgarian pronunciation module. A lot of them are a straightforward application of the information in Bulgarian phonology and its reference sources. Let me know if I should provide additional sources. @Kiril kovachev, Benwing2 has indicated that you're a Bulgarian speaker with the requisite permissions to modify this module, so this discussion might be of interest to you as well.

Issues:

  • the vowel list - local vowels = "aæɐəɤeɛioɔuʊʉ" - does not conform to descriptions of Bulgarian phonetics I've seen before. The core phonemes are "aɤɔuɛi". Unstressed "a" and "ɤ" have the allophone "ɐ", and unstressed "ɔ" and "u" have the allophone "o", for a total of 8: "aɤɔuɛiɐo". The set "æeʊʉ" is ungrammatical in the standard language. This affects the test case яйце́ [jæjˈt͡sɛ], which ought to be [jɐjˈt͡sɛ].
  • related to the above, the following rules are incorrect:
    • term = rsub_repeatedly(term, "([ʲj])[aɐə](" .. non_vowels_c .. "-[ʲj])", "%1æ%2"), featuring æ.
    • term = rsub_repeatedly(term, "([ʲj])u(" .. non_vowels_c .. "-[ʲj])", "%1ʉ%2"), featuring ʉ.
    • term = rsub(term, "([ʃʒ])ɛ", "%1e"), featuring e.
    • the reduce_vowel function uses the rule: rsub(vowel, "[aɔɤu]", { ["a"] = "ə", ["ɔ"] = "o", ["ɤ"] = "ə", ["u"] = "ʊ" }), which is incorrect. The mapping should be { ["a"] = "ɐ", ["ɔ"] = "o", ["ɤ"] = "ɐ", ["u"] = "o" }. This affects several test cases.
    • fixing the vowel reduction rule also makes the following rule redundant: term = rsub(term, "ʊ(" .. non_vowels_c .. "*" .. accents_c .. ")", "u%1").
  • palatalization is over-applied in a way that is absent from the standard language. This entire rule is incorrect: term = rsub(term, "([kɡxl])([ieɛ])", "%1ʲ%2"). Consonants do not become palatalized before "i" and "ɛ" except in certain Eastern dialects. This affects several test cases.
  • this rule is incorrect: term = rsub(term, "ij(" .. non_vowels_c .. ")", "iː%1"). No such change occurs in the standard language. This affects some test cases.
  • some Expected values for test cases are incorrect (besides those mentioned above already):
    • в о́фис [ˈf‿ofʲɪs] - the "f" in "ofis" should not be palatalized, and the vowel is not "ɪ". As for the ligature indicating the way this is pronounced as a single word - it's a nice-to-have, but after the other issues have been fixed.
    • а̀бдики́ращ [abdiˈkʲirəʃ(t)] - I understand that it's testing the grave accent, but this is artificial. The word is pronounced [ɐbdiˈkirɐʃt]. The word-final "ʃt" is actually meaning-bearing, marking this word as the present active participle "abdicating", as opposed to [ɐbdiˈkirɐʃ] (no "-t"), which means "you abdicate" or "you're abdicating". In view of that, I'd rather we nix the rule: term = rsub(term, "([sʃ])t#", "%1(t)#"). Students of Bulgarian can pick up on the colloquial simplification of word-final "ʃt" when they've mastered the standard language.
    • във Фра́нция [vɤ‿ˈfrant͡sijə] - this is not normative pronunciation. Ligature aside, the actual rest result vɤf ˈfrant͡sijə does correctly reflect the fact that the "f" is geminated, rather than single.

Hope that helps,

Chernorizets (talk) 04:40, 21 July 2023 (UTC)Reply

@Chernorizets Thank you – I've been wondering for a small while, as I recently started to record audio for Bulgarian words, whether there might be an error with the pronunciations with palatalisation, e.g. on самоле́т: rendered as [səmoˈlʲɛt]; can you confirm this is the same as your description ("palatalization is over-applied in a way that is absent...")?
Although I unfortunately do not have the permission to edit the module, I can still give my input for a few points; unfortunately, as you have certainly figured, I am not as able as you in the IPA and so I didn't fully clock these discrepancies until you pointed them out; I just haven't been paying enough attention until now to see them. I mostly just read the IPA and agree that it seems mostly right, so I apologise for my long time of inattention.
  • The point about the vowel inventory I definitely also corroborate; also, you can see the discussion from 29 March 2020 in which this point on æ was discussed, but unfortunately I was not there, nor yet mature enough at that time, to give any input on it. But now, I do agree we should not be using that symbol at all.
  • I had not up until now paid attention to the [ʃ(t)] rule, given that we very rarely transcribe present participles, where this kind of difference in the pronunciation of the (t) is semantically significant; per your identification this should definitely be corrected, as as you say, this is a natural clipping by native speakers, albeit not a normative fact of the standard pronunciation.
  • The testcases using в and с: is it right to say that these almost clitic-type pronunciations, which effectively prepend to the subsequent word, are the only cases where this kind of tie should be applied? In this case, a substitution to replace these single words when preceding another word with [f‿] and [s‿] could be a simple implementation.
Separately, this 2004 study, mentioned in the discussion above, also goes into good depth on various other processes, and describes the differences between standard and regional dialects; e.g. word-final devoicing (do we want to represent this, and is it systematic/consistent among speakers? It appears to apply to my usage, so perhaps) such as /'pro.sto/ -> ['pro.sto̥]. Do we want to implement changes in accordance with this and other studies?
I would like to know if you think it's a good idea to generate both Eastern and Western pronunciations for words, given that some of the rules we are now removing actually do apply to Eastern dialects, e.g. palatalisation is the norm for some of those. These changes are not really with respect to any norm for pronunciation, as this type of speech is "sub-standard", but we could still make an effort to select a representative Eastern phonology to contrast against the standard.
Additionally, I'm wondering how feasible it would be to make an automatic hyphenator for Bulgarian? For some languages, like Spanish and Italian it seems, Ben has already written a good and simple program for this, and I'd like to do the same, but for Bulgarian, preferably also avoiding the complication of TeX hyphenation patterns that most, if not all, hyphenation programs for Bulgarian currently use. This would cut out the difficulty of such a big program as that TeX pattern parser appears to be, judging from other open-source versions of it. I believe that modern hyphenation rules have been laxened to allow hyphenations that disregard morpheme boundaries, so this should be able to be done with minimal edge cases.
Thanks again for your input, Kiril kovachev (talk) 21:35, 21 July 2023 (UTC)Reply
Hi @Kiril kovachev, thanks for the thoughtful reply! Let me try to address your points in order.
The standard pronunciation of самолéт is [sɐmoˈlɛt], so yes - it's an example of the incorrect IPA pronunciation which features the palatalized "l" variant. My previous post on this talk page focused narrowly on the pronunciation of "l", until I realized there were further problems. The rule is that palatalized consonants like [lʲ] occur before iotated vowels like "ja" (я), "ju" (ю) and "jo" (ьо), but not before the front vowels "ɛ" and "i". As a speaker of an Eastern Bulgarian dialect, who was also a stage actor for a while, I had to take accent reduction classes in order to eliminate things like [lʲɛ] from my own speech, so I could avoid them on stage.
Regarding the combined pronunciation of conjunctions "в" and "с" which could be expressed via a ligature - sure, that's technically more correct, but considering that we have few multi-word Wiktionary entries where this rule would operate, I don't see it as high on the list of priorities as fixing non-existent vowels, non-existent palatalization, etc. I don't know OTOH the full rules for "clitic-type" joined pronunciations in Bulgarian, and I likely won't have the time to research them in the near future.
Word-final devoicing applies to consonants, not vowels, and is a standard feature of Bulgarian. Voiceless vowels are very rare in the world's languages, and not taught in any Bulgarian textbook I've seen. My overall desire is for us to stick to the presentation of relevant material in Bulgarian phonology, which is also what people get taught in school. I'd rather we defer dialectal and personal variants to a separate resource (e.g. a Wikipedia article), for people with a deeper interest in the full breadth of phonological processes in contemporary Bulgarian beyond just the normative ones.
I apply the same reasoning to whether we should include dialectal pronunications. As a descriptivist, I firmly believe that no one dialect of a language is "better" or "more worthy" than another. However, I think our primary goal should be to achieve sufficient coverage of the standard language on Wiktionary, so that it does function as a reference dictionary, and so that it's useful to learners of Bulgarian. There are still a lot of basic words missing. Bulgaria anyway has a tradition of creating specialized dictionaries for dialectal forms, which accurately reflect the fact that e.g. "Eastern" is a non-homogeneous grouping of dialects with individual differences.
Finally, on the topic of hyphenation - I apologize, but I'm not sure whether you mean hyphenation in compound words (e.g. "морално-нравствен" vs. "природонаучен"), between syllables in Latin transliteration, or something else. If there are clear orthographic rules that would e.g. show up on an exam of Bulgarian, then they would be within scope for improvements to this module. I'd only suggest that we have prioritize the list of potential improvements (maybe a subpage?) and do them in that priority order. I'd start by fixing the incorrect vowels and consonants :)
Thanks, and sorry for the lengthy reply,
Chernorizets (talk) 22:28, 21 July 2023 (UTC)Reply
@Chernorizets It's quite alright, rather than anything else thank you for your long answer, it definitely helps to be thorough for this kind of thing.
Of course the ligature thing is just a side idea as you mention, not overly important but I figured I would at least pose a tentative idea there for how it could be implemented. I'll tell you that the way I think of в and с (but not във and със): as being 0-syllable words, which cannot possibly be pronounced on their own as they have no vowel, or syllable nucleus for that matter. On the other hand, every other Bulgarian word has at least one vowel, as far as I'm aware, so it should not be possible for others to have this behaviour that I described; I think. I'm confident that if we both can't think of anything, that's a good sign...
Also, with reference to the "voiceless vowels" I mentioned: those are really a matter of realisation, it seems, and are not really properly part of the standard speech (unlike languages where they are standard), but in this case the devoicing does apply to vowels, just not in the same system as consonants. The example with просто is one realisation, in which the speaker happens to say no obvious vowel sound at the end of the word, although their lips still move into the /o/ shape; as far as I understand. Anyway, probably this issue is not significantly noteworthy, so thanks for this input, I guess there is no need to deviate into obscure or theoretical linguistics and the Bulgarian phonology page is a better grounding as you outlined.
By hyphenation (confusingly-named), I really mean syllabification, in which we turn a word like просто into про-сто or прос-то or whatever. There are rules concerning where syllables can be broken, formulated by БАН as far as I know, and this topic is apparently taught to students in Bulgaria from as young as first year—see the discussion above where Martin123xyz describes this same thing. We would need to write a function to apply all of the syllabification rules to a word so that it generates a correct syllabification. We do this thing already for other languages, and some editors may well have already input such hyphenations manually on some Bulgarian entries, so I definitely think it'd be a worthy addition if we can make it happen.
Although it's gotten late again for me, and although I can't edit the module directly, I'll see tomorrow if I can copy the code locally and first of all implement all of the suggestions you listed out here. Then if you want we can move them to a subpage on here, tick off which ones are achieved, and proceed from there... Your advice has been stellar, perfectly clear and precise, so we should have no difficulties just fixing the relevant lines in accordance with those; we just need Ben or another privileged user to port them over if all is good. Sorry for my long delay, Kiril kovachev (talk) 19:11, 22 July 2023 (UTC)Reply
Hi @Kiril kovachev, thanks for offering to try some of the proposed changes out! I'd like to learn how to do that too - I'm still very new to how module changes are tested in sandboxes before they're made official. If there's a page that already describes that, I'd be happy to RTFM :) I'm a developer - not a Lua one, but that's fixable.
Here are some test cases that come to mind for the problem areas I've raised. Note that I don't use the vowel fronting IPA combining mark, so the resultant IPA might need some massage. (I don't think we should indicate vowel fronting, but that's a discussion for another day).
  • unstressed vowel reduction
    • сала́та → [sɐ'ɫatɐ]
    • шега́ → [ʃɛ'ga] - no reduction of "ɛ" after "ʃ"
    • жена́ → [ʒɛ'na] - no reduction of "ɛ" after "ʒ"
    • инти́мен → [in'timɛn] - no reduction of "i" in any environment
    • посо́лство → [po'sɔɫstvo]
    • ъ́гъл → ['ɤgɐɫ]
    • усу́квам → [o'sukvɐm]
  • no palatalization before "е", "и". For the "soft L" allophone that occurs before those two, I would just use IPA [l]
    • ле́ща → ['lɛʃtɐ]
    • липа́ → [li'pa]
    • океа́н → [okɛ'an]
    • меки́ца → [mɛ'kit͡sɐ]
    • ла́гер → ['ɫagɛr]
    • маги́я → [mɐ'gijɐ]
    • хем → [xɛm]
    • химн → [ximn]
  • no uses of any of [æeʊʉə] - I'm getting a bit lazy, but you can identify a list of words from the rules.
HTH,
Chernorizets (talk) 21:57, 22 July 2023 (UTC)Reply
@Chernorizets I too, would like to know if we have an "official" procedure for testing things, but the current "workflow" I'm used to when editing modules is:
  1. copy+paste the current module code into a page Module:User:Kiril kovachev/(module name)
  2. copy+paste the template wrapper for the module, as Template:User:Kiril kovachev/(template name) changing the "invoke" line to point to the new module
  3. start editing, and you can visually test the module's behvaiour by calling {{Template:User:Kiril kovachev/(template name)}} with the desired arguments.
I don't know how to set it up to have check/error marks, but still you can have the table of expected and actual results side-by-side in a separate sub-page somewhere. I'm also not a Lua user, so some of the changes I make are a little experimental, but fortunately it's not too complicated to learn, I think.
Now, I'm very sorry, because although I said I would make these changes as discussed yesterday, I've been monstrously sick and so sadly I couldn't get anything done. I apologise greatly, but hopefully if things are better tomorrow, I may be able to figure it out. If you overtake me and do these things yourself, please let me know. Kiril kovachev (talk) 16:54, 23 July 2023 (UTC)Reply
@Kiril kovachev, by all means focus on getting better! There is no rush.
Cheers,
Chernorizets (talk) 08:48, 24 July 2023 (UTC)Reply
@Chernorizets It's all good, I've gotten back to most of my full strength by now. My belt needs a few new holes stapled in, though :-)
I ported the module and did the proposed changes (if I'm not mistaken); you can see it at: Module:User:Kiril kovachev/bg-pronunciation, see the test cases at Module:User:Kiril kovachev/bg-pronunciation/testcases, and the template at Template:User:Kiril kovachev/bg-IPA. You can view the differences from the original using this diff.
I think there are still some problems with the current scheme, however:
  • щастли́в maps to ʃtɐˈstɫif instead of ʃtɐˈstlif — or should it be ʃtɐˈslʲif as before? (The current module removes the t)
  • учи́лище becomes oˈt͡ʃiɫiʃtɛ instead of uˈt͡ʃil(ʲ)iʃtɛ (again, same ɫ problem; and what about that initial vowel?)
  • уби́йца is now oˈbijt͡sɐ, again the o instead of u is troubling me
  • жа̀р-пти́ца now becomes ˌʒɐr-pˈtit͡sɐ, with this a apparently being of the wrong quality – shouldn't it be a?
  • в о́фис: why is the output f ˈɔfis not coming out with an o, but with a ɔ?
  • ня́ко̀лко again shows the problematic vowel quality when an a is stressed: ˈnʲɐˌkɔɫko is surely wrong here.
Do you have any suggestion as to how to resolve these? I note the following three problems:
  1. ɫ is being generated in places where l(ʲ?) should be
  2. а is being given the wrong vowel quality (ɐ) in places where it should be stressed, as a.
  3. o is being placed instead of u (q: is this actually correct?)
Definitely the first two are absolute problems, I think, so we should look to resolve them. Also, I couldn't think of any test cases which originally used e or ʉ: the others are covered variously by several test cases: ə by countless examples, ʊ by тулу́п, æ by яйце́. What are your thoughts? Kiril kovachev (talk) 12:00, 24 July 2023 (UTC)Reply
P.S.: Could you lend your input as to the correctness of the following testcases?
  • щастли́в → ʃtɐˈslif
  • учи́лище → oˈt͡ʃiliʃtɛ
  • уби́йца → oˈbijt͡sɐ
  • тулу́п → toˈɫup?
After some thought, I feel more inclined to believe the o is in fact a correct realisation of у, but I found it difficult to believe when I just saw it output instead of the /u/ I expected. If that's right, then it would seem the great majority of test cases do indeed pass now. Kiril kovachev (talk) 12:26, 24 July 2023 (UTC)Reply
@Kiril kovachev I'm about to finally go to bed (6am over here) so I hope this is coherent :)
I feel you on the [o] vs [u] in unstressed positions - it's just the IPA symbol for the close-mid back rounded vowel - if you listen to the audio, it's somewhat between what we think of as "о" and "у" (Cyrillic). Compare that to the open-mid back rounded vowel [ɔ], and that's much more clearly the vowel "o" in stressed positions. When we pronounce an unstressed "u", our mouths "slacken" a bit and it lowers to [o], which is also why schoolchildren frequently misspell words beginning with an unstressed "u" with an "o", e.g. "оспех". Now, whether that would be confusing to some readers, I can't be sure. I'd hope that people using the IPA transcriptions as a real-life aid would have more than a passing familiarity with it, but it did bother me at first as well so idk.

As to the four test cases in your P.S. - they all look good to me. IMHO жа̀р-пти́ца should just use the acute accent on both words, as opposed to the grave on жар. I guess the point was to use the grave accent to mark secondary stress, but 1) I don't know whether "secondary stress" applies to Bulgarian the way it applies to e.g. English, and 2) the end result is still just regular stress. The alternative would be to make the vowel reduction logic sensitive to the presence of a grave accent, in which case it should be a no-op.

For the ня́ко̀лко test case, try seeing what happens if you remove the grave accent on the "o". In any case it should start with [nʲa] since the stress is on the "я". I just checked that "пясък" has the expected IPA on Wiktionary, so if it's not the grave accent, then you might have made an unintentional change.

Thanks for all your work and time,
Chernorizets (talk) 13:52, 24 July 2023 (UTC)Reply
@Chernorizets I see, hope you sleep well then! Thanks for your help with the phonetic symbols, I was thinking along a similar line but I wasn't confident that thinking was correct. So it was that у becomes [o] when unstressed... Now, if that's the actual phonetic realisation, as opposed to [u], then that means we're doing it correctly, and the module should not need to accommodate for any misunderstandings; we just need to do it right. I ported over a key a long time ago, at Appendix:Bulgarian_pronunciation, which is linked alongside every IPA production, so if there does arise any confusing, I'm hoping users would flick over to that page and clarify any misunderstandings. (Also, thankfully, that page was taken from w:Bulgarian phonology, so it doesn't suffer any of the errors from the previous version in this module.)
I also don't know the validity of the grave/secondary stress in Bulgarian; to me it seems manifest in some words, but sometimes not, and I can't tell if this is just my analysis stemming from spending too much time in English thought or whether it truly does exist. I was asked by @Atitarev at one point to provide a pronunciation for a word, which I now forget, and when I provided a secondary stress, he too questioned whether it it was legitimate or not. Do you have any input from since then, Anatoli?
With regard to the generation of a vs. ɐ, then even жа́р пти́ца — using the current stable module — gives ˈʒɐr ˈptit͡sə, so at least that much doesn't appear to be the fault of my changes, however ня́ко̀лко could be fixed by removing the grave (so it's confusing why this problem comes about, anyway). It seems 1) there's a problem when there are independent, stressed words, and 2) the grave produces problems as well. Otherwise, things are looking very good, I think. Kiril kovachev (talk) 15:57, 24 July 2023 (UTC)Reply
@Kiril kovachev agreed, things are looking pretty good! I wouldn't worry about ня́ко̀лко - it's broken in the current module as well, and besides it's a highly unnatural example, since it's trying to force the [ɔ] in a place that is pronounced as [o]. I would remove it as a test case probably.

I can't tell how common the grave accent is in Bulgarian lemmas. The actual entry for жар-птица (žar-ptica) follows the convention of Речник на българския език where only птица has a stress mark, and the grave is only used from inside the bg-IPA template. I think we should just override the IPA manually in that case for now. Interestingly, IPA(key): [ˈʒa̟r] gives exactly the outcome you'd expect, so this looks like a bug limited multi-word contexts.
Thanks,
Chernorizets (talk) 22:15, 24 July 2023 (UTC)Reply
@Kiril kovachev: Sorry, I am a bit lost in this discussion and I don't know the rules for secondary accents in Bulgarian (neither when and nor how). Bulgarian online dictionaries, at best, provide two accents or no accent when it is a secondary, and as you know, they use grave for any accents, we at Wiktionary use acutes for the main accents and a grave accent symbol would be right for the secondary accents. Anatoli T. (обсудить/вклад) 23:34, 24 July 2023 (UTC)Reply
@Atitarev Alright, sorry to trouble you then, I don't mean to waste your time. We're trying to investigate the usages of secondary stress right now, so I only meant to canvass your help in case you knew about this — if you happen to learn something or anything like that, please do let us know, Kiril kovachev (talk) 08:46, 25 July 2023 (UTC)Reply
@Chernorizets Also, good news!! I extended the module to tentatively support hyphenation, with the output presently looking like this:
  • IPA(key): [soˈdʒu̟k]
  • Hyphenation: су‧джук
which I am very happy with! The source from which the hyphenation guidance is derived from is this, i.e. ultimately Институт за български език, judging by the 2012 standards of hyphenation. Note that what was achieved is not syllabification—and I don't know if that's in fact what we want, but at any rate we have here a starting point—but hyphenation, i.e. determining adequate boundaries at which to separate the text, e.g. for typesetting line-breaks.
I made a new test case table, partly populated with your recently manual hyphenations; the only really wrong test result was доило (your manual hyphenation: до-и-ло; produced: до-ило). According to rule #12, "One letter does not stay alone", principally I think the generated output is correct from a "hyphenation" standpoint, but arguably not from a "syllabification" one. I can see a few main resolutions to the problem, here:
  1. Use a more sophisticated algorithm, which will actually generate the correct syllabification.
  2. Allow the user to pass their own syllabification as additional parameters.
Either will work, but the rules to abide by for syllabification will be harder to come up with, I think, because the authorities don't comment on strict syllabification per se but merely the hyphenation mentioned above. For this reason I wouldn't be averse to just letting us manually override the hyphenation when using this template.
Finally, since the template is now covering a broader scope... would it make sense to rename to bg-pron?
Kiril kovachev (talk) 22:16, 24 July 2023 (UTC)Reply
Hey @Kiril kovachev, great news indeed :)

Regarding hyphenation, it was my mistake to treat it as syllabification. According to this Wiktionary guide, the main function of a Hyphenation section is to describe valid ways of doing line breaks, i.e. сричкопренасяне. The guide says, somewhat confusingly, that both hyphenation and syllabification are supported, but as you correctly point out, they don't always overlap in Bulgarian. We could in theory do things like:
  • {{hyph|bg|дои|ло}}, {{hyph|bg|до|и|ло|caption=syllables}}
  • Hyphenation: дои‧ло, syllables: до‧и‧ло
But that seems like too much work. Personally, I'm fine with just hyphenation.

I know this might break your heart, but I'm actually against modifying the bg-IPA template to automatically do hyphenation. I'd put that in a separate template - e.g. bg-hyph, akin to the generic . As far as I can tell from your code, the hyphenate function doesn't depend on anything in the toIPA function, so it could easily live in its own module as well. There are a few reasons for this:
  • single-responsibility principle - a template should do one thing, and do it well. IPA transcription is, at best, only weakly related to hyphenation, so a single template to do them both makes it easier to accidentally break one while modifying the other, esp. if they share a module.
  • the entry layout guide shows that there are other optional headers whose recommended order comes between IPA and Hyphenation, e.g. Audio, Rhymes and Homophones. I'd actually like to see us use them more - Rhymes, especially, becomes possible once we've fixed the IPA transcription. If we have a combined IPA/hyphenation template, there's no way to add these subsections in the correct order per the layout guide. Also, entries that have audio files today will end up having the audio after Hyphenation.
Thanks,
Chernorizets (talk) 03:59, 25 July 2023 (UTC)Reply
@Chernorizets Hahaha! This very well almost did break my heart, but it actually seems we're for the same sort of thing, I just did poorly to express what I actually meant above: I also don't think we should force the current template to subsume all these separate functions at once. Creating a new one for hyphenation would be excellent, but just in the same manner as {{es-pr}} it would also be good to have a 'master template' which can output any and all of the pronunciation components, in the correct order, by itself, which saves us the effort and error-source of re-duplicating all the template arguments for each individual feature of pronunciation.
Also, it's good to see that hyphenation on Wiktionary is what I thought it was — this makes our job a lot, lot easier, I reckon. What remains is to make sure the hyphenation is just about infallible before deploying anything, and deal with the remaining flaws we have in IPA generation. Kiril kovachev (talk) 08:55, 25 July 2023 (UTC)Reply
As for secondary stress, or "второстепенно ударение", I was able to find this book, with a telling abstract (note the opposite use of grave and acute accents compared to Wiktionary):
Вторично ударение имат някои сложни думи, напр. áвтомонтьо`р, но можем да произнесем и автомонтьо`р. По същия начин можем да произнесем Ле`нингра´д и си´лното`ков, но и Ленингра`д и силното`ков. Смята се, че второстепенно ударение въвеждат в думата и някои предпоставки от чужд произход, напр. а´нтикомуни`зъм, про´фаши`стки и т.н. Частиците еди- и годе-, с които се образуват някои сравнително редки неопределителни местоимения, също носят второстепенно ударение, напр. е´ди-ко`лко си, що´-го`де.
So it could be something we address in the fullness of time. I see a need for it, but I wouldn't call it pressing.
Thanks,
Chernorizets (talk) 04:09, 25 July 2023 (UTC)Reply
I see, this is good. It seems the scope of secondary stresses is much smaller, so it should be rare that we absolutely need it; but as you say I would also wish for us to be able to support it sometime. Kiril kovachev (talk) 08:56, 25 July 2023 (UTC)Reply
@Kiril kovachev, Chernorizets Hi, apologies for being AWOL on this thread. I will read it all and respond shortly, but as for {{bg-pron}}, I would maybe recommend using {{bg-pr}} as the name because we have a similar thing for Spanish under {{es-pr}} and Italian under {{it-pr}}, and templates named {{CODE-pron}} are typically used for pronouns. (There's also {{pl-p}} and {{fi-p}} which are conceptually similar but use separate template params in place of inline modifiers, but {{pl-p}} will shortly be rewritten to work like the Italian and Spanish templates and will be renamed to {{pl-pr}}.) Benwing2 (talk) 03:57, 25 July 2023 (UTC)Reply
@Chernorizets The way this is handled in other languages is to have two templates, e.g. for Italian {{it-IPA}} just outputs IPA pronunciation while {{it-pr}} outputs pronunciation, syllabification and rhymes (and optionally audio and homophones, if specified) on separate lines. We could create {{bg-hyph}} if so desired. The advantage of this is that you can output things separately if you want, but you can also output everything together and avoid having to type the respelling multiple times. Benwing2 (talk) 04:04, 25 July 2023 (UTC)Reply
@Benwing2 a separate template that combines all (optional) pronunciation subsections with the ability to omit some of them sounds good to me - it just sounded like that's not what was being proposed. That said, I'd like to decouple the work of fixing this module from other objectives. My preferred work breakdown would look something like this:
  • (near-term) finalize fixes to the IPA transcription logic, and "deploy them to production"
  • (near-to-medium term) put the BG hyphenation logic in its own module and give it a template
  • (medium term) create a bg-pr template along the lines of it-pr et al. I don't know/think that many editors are providing rhymes or homophones for Bulgarian entries right now, so I feel like a module/template refactoring to bring everything into one umbrella could happen later.
Chernorizets (talk) 04:22, 25 July 2023 (UTC)Reply
@Chernorizets My main concern about having the hyphenation logic in its own module would be possible duplication of code; typically, the IPA logic needs to hyphenate as well, and some of the code could be shared. Other than that, this is fine with me. Benwing2 (talk) 04:59, 25 July 2023 (UTC)Reply
@Chernorizets As per Benwing2, I also think it better to keep the module intact, but just have separate functions and templates for each concern it manages. This way we simultaneously preserve the single-responsibility principle you mentioned (i.e. each function only does one thing, even if the module still does many), but all the common features between the IPA and hyphenation can be shared in the single module. To have multiple templates interface with a single module should be just as easy as doing the same with multiple modules, so from that perspective things should also be no different.
Further, what do you see as being the final completion of our IPA code? Passing all the tests on the current testcases page? I intend to check out the problem we still have of generating incorrectly-reduced <a>s, but I don't know what else we need to attain before we can deploy, as you say. I am fine with making a new hyphenation module right away, though, that much is effortless, and if we want to move that logic to a new module, that would also be no trouble at all.
Also, @Benwing2: do you propose that we pass the hyphenation in some form to the IPA template, like to allow it to place stresses? I saw in the IPA code that you had written some logic to adjust the position of the accent, and... Although I would advocate we somehow "synchronise" these two functions, which are currently altogether separate, I fear they may not be totally compatible, because the IPA one should operate on syllables (am I right in saying so?) and the hyphenation one, on breakable boundaries. In this way one would not be a suitable input to the other. At the very least we can have both utilise the common properties in the module (although admittedly there isn't actually that much overlap—I initially considered that keeping the module as one would be better, but seeing that virtually nothing, except the rsub/rsub_repeatedly functions are actually used in both, is there purpose in keeping them together?) Kiril kovachev (talk) 09:18, 25 July 2023 (UTC)Reply
@Kiril kovachev, having taken a look at the es-pronunc module, and in line with Benwing's comments, here's a summary of what I think:

  • I'm ok with keeping the hyphenation and IPA logic in the same module, under different exported functions, just as you have it right now. Having that separation gives us the option of extracting hyphenation out, if for some reason we decided to do so down the line.
  • to finalize the IPA code, I only recommend that you change the expected values of the 5 failing tests to the actual values. The actual values are correct per the current implementation. If we modify the module later to better handle secondary stress (via grave accents), or support joint pronunciation of "в" and "с" with the following word, then we'd expect those tests to "break in a good way", but keeping them in a broken state until then doesn't give us useful information IMO.
  • to finalize the hyphenation code, I recommend:
    • that you change the "доило" expected result to the actual: до-ило. That is correct,
    • that you add test cases for words whose first syllable is a single vowel, like укор, упорит, осем, оценка. Since you can't have a single letter on a line by itself, we'd expect to get: укор, упо-рит, осем, оцен-ка. this online tool should be helpful.
    • related to the above, we should have a few test cases for monosyllabic words, to prove that we don't try to hyphenate them.
  • I leave the choice of whether to embark on a big adventure like {{es-pr}}, or just creating a {{hyph}} derivative, up to you :) You've spent more time than me on this, so it's only fair that you decide how much more time you'd like to invest.
Cheers,
Chernorizets (talk) 10:00, 25 July 2023 (UTC)Reply
Thanks @Chernorizets for your quick input,
  • That's fine, I intend to keep the functions separate I think. However, I looked at the code again and they are practically entirely disjoint: one can easily be factored out if need/want be: do you think we should do this in the end?
  • I changed almost all the IPA testcases, the only remaining problem being жар-птица. I don't hate manually inputting the IPA on that entry right now, but we need to have this working before we can confidently change the test case IMO. This is just definitely wrong:
ʒa̟r-ˈptit͡sɐ (expected; also, I removed the grave)
ʒɐr-pˈtit͡sɐ (actual)
I don't know about you, but I'm not dramatically fussed about the placement of the stress symbol, but the wrongly-reduced vowel is a problem, however rare.
I'm looking into the cause right now, but just letting you know why there's still that error on the test page.
  • Hyphenation-wise, thankfully after adding your proposed test words and a few more monosyllabic ones, all of them hyphenated according to the rules, so we're pretty much good to go there. There are two further concerns which I'm facing with the hyphenation right now:
    • Check whether it works properly for multi-word terms. I tested it for човек и природа (čovek i priroda), and it worked well there. (And principally I expect it to work in general because the spaces should serve as boundaries that delimit where the regular expressions are able to hyphenate, but I believe we need some more assurance.)
    • There is probably some gnarly bug when the term already has a hyphen in its spelling like жар-птица. I'm just splitting on hyphens when passing into the hyphenation module, so that's bound to be a problem...
Anyway, I made {{bg-hyph}} for now if you want to check it out. My {{Template:User:Kiril kovachev/bg-IPA}} only outputs IPA now, so we're back to the original intent of the template.
  • I'm more than happy to go the whole mile, as long as I can learn the proper way to handle all the other pronunciation features, like rhymes and so on. All in good time, though!
  • Also, we'll have a lot of work to do when converting all the pages that currently use {{bg-IPA}} if we want to migrate everything automatically to {{bg-pr}} or whatever—and not because moving the template would be hard at all—but because the existing audio, hyphenation, etc. would need to be accounted for and duplicates deleted. Additionally, for words with multiple pronunciations, such as those that differ dialectally, we'll need a way to specify both together, in the correct order... overall that's a late-game problem for us to deal with later x)
At any rate: I'm incredibly grateful for your continued support & advice, and I hope you'll stay around as we get round to the other bits. Kiril kovachev (talk) 11:55, 25 July 2023 (UTC)Reply
P.S. Am I wrong, and ʒɐr-pˈtit͡sɐ is actually valid? I'm starting to doubt which is meant to be the right one. Kiril kovachev (talk) 12:04, 25 July 2023 (UTC)Reply
@Chernorizets Alright, I believe I found the cause of the bug, although I would be grateful for your thoughts on this find:
There was a line in the code that looked like:
-- /a/ directly before the stress is [ɐ].
 term = rsub(term, "a(" .. non_vowels_c .. "*" .. accents_c .. ")", "ɐ%1") 
If I'm not mistaken, this process is now redundant, because all /a/s reduce to ɐ under the new rules, and there's a line directly under that which handles the generic reduction of vowels. Does removing this seem like it might affect other words, or is it truly redundant like I think it is? (I ask theoretically, because the testcases were entirely unaffected. Word like ами́ and ако́ also don't change)
We can actually add the grave back in, which produces this pronunciation: IPA(key): [ˌʒa̟r-pˈtit͡sɐ]. If this is what's intended, then we've got it down, using a grave accent. However, we may want a way to signify a non-reduced vowel without giving it an IPA accent mark. I don't know. Kiril kovachev (talk) 12:21, 25 July 2023 (UTC)Reply
@Chernorizets Hello again both. Sorry to spam so much. But I am announcing the latest development, which is an effort to create a rhyme generation function as well. I don't know if I'm thinking far too simplistically about this, but it seems like the rhymes can just be generated from the IPA. I made {{Template:User:Kiril kovachev/bg-pr}} which displays this and looks like this:
  • IPA(key): [ˈprimɛr]
  • Rhymes: -imɛr
  • Hyphenation: при‧мер
@Benwing2, as I know you've handled rhymes in the past before, could you please advise whether there is any flaw in how I'm generating the rhymes as it is?
  1. Get the IPA as input;
  2. Find the position of the stressed syllable, either by checking directly for the primary stress, or assuming it's at the start for monosyllabic terms. (For polysyllabic where it isn't indicated, I don't know if I should return nothing or an error)
  3. Scan along until I find the vowel (i.e. ignore the whole onset of the syllable)
  4. Return the entire rest of the word after that.
The idea being that this should result in the suffix which, if another word has after its stressed syllable, then it would share that same stressed vowel + the rest of the end of the word. There may be a fundamental problem to this but I'm not able to see it without some more wisdom and experience.
Anyway, I set up a number of test cases for this also, which are fortunately looking good. I would appreciate any ideas for what more to test here: multi-word terms? Should those just be ignored altogether?
Tomorrow, I'll look towards incorporating optional audio and homophones as parameters to the would-be {{bg-pr}}, if that sounds good to you both. Kiril kovachev (talk) 20:50, 25 July 2023 (UTC)Reply
Hi @Kiril kovachev, let's split further discussions on hyphenation, rhymes and the uber-template off of this thread, and into their own threads on this talk page. Sorry for not suggesting this sooner. I'd like to keep this conversation focused on fixing bugs vs. new features. It's becoming hard to follow the discussion, esp. for people besides you and me.

I'm happy to continue reviewing your new feature work, but I'll abstain until your fixes to IPA have been incorporated into the mainspace module. IPA is broken for most Bulgarian entries, today. I don't want the fixes delayed any further.

Let's not change the rule that would fix жар-птица in this iteration, and keep that as a failing case. I can't easily estimate the impact of removing that rule, so it's safer to attack it in a subsequent change.
Thanks,
Chernorizets (talk) 23:14, 25 July 2023 (UTC)Reply
@Benwing2 In reference to supporting an "all-star" pronunciation template that features multiple calls to the different functions: is there any performance or other downside to requireing the language object, script object, parameters module, etc. at the top-level, instead of dynamically loading them when the functions are called? I would like to try to factor out some of the copied code I created earlier, specifically to include the hyphenation and IPA both in a show_all entry point for a hypothetical future {{bg-pr}}. They currently separately load these modules as they need them, but I'd rather move them into one place and either 1) use them globally or 2) pass them around once after they're loaded. Kiril kovachev (talk) 12:55, 25 July 2023 (UTC)Reply
P.S. I now did a sort of "refactor" of the main entry points of the module, which I can't tell if is better or worse: diff. The idea was to avoid copying the same argument-processing code and to re-use the parts of the code which format the pronunciation information from our internal representation to their textual form to display. These take the form of the format_ipa, format_hyphenation, and format_rhymes functions, eliminating some of the duplication. I also took the lines:
  • local lang = require("Module:languages").getByCode("bg")
  • local script = require("Module:scripts").getByCode("Cyrl")
and put them at the top so that each function could use them. If you advise otherwise for performance or other concerns, it would be easy to revert this, but it seemed to me a good idea to avoid re-writing this commonality several times. Kiril kovachev (talk) 20:58, 25 July 2023 (UTC)Reply
Summary of the above discussion
@Benwing2 @Chernorizets This is a summary of all the IPA-related changes that've been proposed and carried out in the above discussion, and which are now located at Module:User:Kiril kovachev/bg-pronunciation.
Effectively, Chernorizets had noted a few discrepancies between the output of the IPA function used up until now and the actual pronunciation used in the standard language, as noted in the original comment. What has been implemented IPA-wise in the code (seen in this diff) is:
  • The vowel inventory is reduced from [aæɐəɤeɛioɔuʊʉ] to [aɤɔuɛiɐo].
  • [a] and [ɤ] reduce to [ɐ] instead of [ə]; [u] reduces to [o] instead of [ʊ].
  • All vowel assimilation changes, besides fronting ([ʃʒʲj] + vowel), have been removed.
  • Palatalization as a sound change rule has been removed. (Removed term = rsub(term, "([kɡxl])([ieɛ])", "%1ʲ%2"))
  • Because palatalization is no longer applied, /l/ now remains /l/ (instead of becoming ɫ) before /ɛ/ and /i/, as well as before /ʲ/ as before. E.g. леща does not become /lʲɛʃta/ (as before) or /ɫɛʃta/, but /lɛʃta/.
  • Word-final /st/ and /ʃt/ do not become /(s)t/ and /(ʃ)t/.
  • /ij/ does not become /iː/.
All of these changes bring the pronunciation generated more in line with the pronunciation of the standard language. This also makes the module match the description found at w:Bulgarian phonology, as well as this IPA document, which also provides a passage of the Sun and the North Wind story, and with which the newly-changed module now complies.
Further, at Module:User:Kiril kovachev/bg-pronunciation/testcases, you can find that:
  • The failing test in the stable for ня́колко (njákolko) has been corrected.
  • The tests в Япо́ния (v Japónija), в о́фис (v ófis), and във Фра́нция (vǎv Fráncija) have been re-written, avoiding the tie between words, and retaining the geminated /f/ in the final case.
  • The test жар-птица (žar-ptica) is still broken, but a fix has been identified: we don't yet know whether it will affect other entries negatively, however.
Again, these changes are a positive shift from their previous state, and we have also added more test cases to thoroughly measure the phonological changes that have been implemented.
Dear Benwing: if these changes are of satisfactory quality, please would you consider porting the changes over to the production module? As Chernorizets noted above: "IPA is broken for most Bulgarian entries, today." It would be best that we can fix these problems as soon as possible, once you are up to speed with the changes made.
P.S. Separately, for further pronunciation-related developments as we have discussed here above (hyphenation, rhymes, etc.), new threads will be started so that we don't further pollute this thread with extraneous details. Ultimately, the IPA-related changes are of highest importance. Kiril kovachev (talk) 10:44, 27 July 2023 (UTC)Reply
Excellent summary, thanks @Kiril kovachev! @Benwing2, what are the next steps to get the IPA fixes into the mainspace module? As Kiril pointed out, some of the other topics that were raised during the discussion - on hyphenation, rhymes, etc - will be handled separately, so the scope of changes is just what's needed to fix IPA issues, as per my original request.
Thanks,
Chernorizets (talk) 21:03, 29 July 2023 (UTC)Reply

──────────────────────────────────────────────────────────────────────────────────────────────────── @Kiril kovachev Apologies, still haven't had a chance to read through this whole thread. As for calling require at the top level vs. when needed, it shouldn't make much difference. Within a single invocation of Lua (i.e. a single #invoke or frame:expandTemplate or frame:preprocess), modules are loaded only once, so calling require multiple times on a given module will not keep reloading the module. Since it's certain that you'll need to e.g. call require("Module:languages").getByCode("bg") at some point in the code, it shouldn't matter if you call it at top level or when needed. Benwing2 (talk) 06:02, 26 July 2023 (UTC)Reply

@Benwing2 Alright, no worries, please at your own leisure. That's good to hear, I was wondering whether there was any wastage if the modules were required multiple times, but if not then that's good. Kiril kovachev (talk) 07:46, 26 July 2023 (UTC)Reply
@Kiril kovachev, Chernorizets Thanks for the detailed summary above. All looks good to me. As for how to update the module, probably the best way is if one of you two is able to do it. User:Fenakhay recently changed the protection status of the module from only template editors + admins to autopatrollers, which I think is a lowering of the protection. Neither of you are autopatrollers but I suspect it wouldn't be an issue to grant this to you two; an alternative is to make one or both of you template editors, which would make sense if you're planning on updating other modules as well. Fenakhay or User:Chuck Entz can you remind me what "autopatroller" status means exactly and what privileges it confers? In the meantime I can make the changes if you want; what exactly needs to be done? Benwing2 (talk) 19:51, 1 August 2023 (UTC)Reply
Hi @Benwing2, if you don't mind making the changes, that would be great. There are essentially two changes to be made:
Thanks,
Chernorizets (talk) 21:46, 1 August 2023 (UTC)Reply
@Chernorizets@Kiril kovachev Done. Benwing2 (talk) 21:53, 1 August 2023 (UTC)Reply
@Benwing2 Amazing!! It's so much better now! Thank you so much! I'm so glad this got fixed!
Since the module now also supports hyphenation (to be included in a template sometime soon), would you mind also porting over the hyphenation test cases? Or did you have a specific reason not to port them over?
Thanks a million,
Chernorizets (talk) 22:41, 1 August 2023 (UTC)Reply
@Benwing2 Massive thanks for putting this through! Kiril kovachev (talkcontribs) 00:34, 2 August 2023 (UTC)Reply

Fixing жар-птица and other edge cases[edit]

@Chernorizets @Benwing2 Hello, in reference to the previously-identified rule:

 -- /a/ directly before the stress is [ɐ].
term = rsub(term, "a(" .. non_vowels_c .. "*" .. accents_c .. ")", "ɐ%1")

I want to make a case for why this is now redundant given the new IPA rules. This rule was presumably originally written so that /a/ before a stress would be converted into /ɐ/ instead of /ə/, but because our new rules reduce /a/ to /ɐ/ in all non-stressed positions in the first place, specially doing so before the stress is now unnecessary. Furthermore, as I pointed out in the above discussion, removing this rule is also a fix to the erroneous reduction of жар-птица, which refuses to take an accent and keep its /a/ prominent even if we give it one. (Expected: /ˌʒa̟r ˈptit͡sɐ/ or similar. Actual: /ˌʒɐr ˈptit͡sɐ/) I believe the use of "[^" .. vowels .. "]" as the pattern for non_vowels_c may be a problem, because the word boundary is ignored, meaning the reduction will still apply even if the stress falls on a word much later in the phrase. We can either:

  • Remove this rule; unless you can note any other exception, which I can't, it does nothing besides what we already now have.
  • Change non_vowels_c to "[^" .. vowels .. "#]" (with # being the word-boundary marker used internally).

... or both. What do you think? Kiril kovachev (talkcontribs) 11:33, 3 August 2023 (UTC)Reply

Also, I implemented this in Module:User:Kiril kovachev/bg-pronunciation, and it worked as expected for the given word now. Kiril kovachev (talkcontribs) 11:36, 3 August 2023 (UTC)Reply
@Kiril kovachev Sounds good. It seems to me that you should remove the rule, and also remove non_vowels_c entirely since it's no longer used (and if it's needed in the future, whether it should work across word boundaries will depend on the use case). Benwing2 (talk) 00:30, 4 August 2023 (UTC)Reply
@Benwing2 Done. I also take it the fixme from that comment no longer applies, because the stress movement should now not be tied to reductions if I'm not mistaken? Kiril kovachev (talkcontribs) 00:36, 4 August 2023 (UTC)Reply
@Kiril kovachev Yes, that sounds right. I pushed the code live. Benwing2 (talk) 00:45, 4 August 2023 (UTC)Reply
Thanks! Kiril kovachev (talkcontribs) 12:03, 4 August 2023 (UTC)Reply

Syllabification from hyphenation[edit]

Hello, so we were earlier considering the difference between syllabification and hyphenation; and how we were content with keeping just hyphenation. I've no issue with this, but I think it would be principally quite possible to derive syllables from the hyphenation, and if I've figured it out correctly, quite easily at that.

What we can guarantee, assuming "by induction" that the hyphenation is already correct, is:

  • There must be at least one vowel in every segment, based on rule 11: "There must be at least one vowel before and after the hyphen."

As syllables are counted by their nuclei, i.e. vowels, every segment of the hyphenation counts for at least one syllable. But under what circumstance can a segment count for more than one? Only if it contains more than one vowel. Under these two rules:

  • "In a sequence of two or more vowels, the first vowel stays before the hyphen. For example пре-одолея /pre-odoleya/ and прео-долея /preo-doleya/."
  • "In a sequence of three or more vowels, the last vowel stays after the hyphen. For example мао-изъм /mao-izam/ but not маои-зъм /maoi-zam/."

We can end up with hyphenations such as Syllabification(key): пре‧о‧до‧ле‧я

  • Hyphenation(key): пре‧одо‧лея, Syllabification(key): ма‧о‧и‧зъм
  • Hyphenation(key): мао‧изъм, where the conversion to syllables involves just ensuring that individually-pronounced vowels are separated, as well: Syllabification: пре‧о‧до‧ле‧я; Syllabification: ма‧о‧и‧зъм.

I propose the following conversions (of the already-hyphenated string) to achieve this, then:

  • Separate all vowel clusters into individual syllables with a hyphenation point in between, e.g. мао‧изъм → ма‧о‧изъм.
  • Then, separate all segments which start with a vowel from the subsequent consonant, in this case yielding ма‧о‧и‧зъм.

For преодолея (preodoleja), пре‧одо‧лея → пре‧одо‧ле‧я → пре‧о‧до‧ле‧я. If this is a sufficient minimal ruleset, then we'd have here a very nice conversion to syllables from the hyphenation. The main reason we'd want it is displaying syllable boundaries to the readers, but we could also use it as an input to the IPA function as a way of factoring in the syllable boundaries into the IPA symbols. Further, we could use it when categorising hypothetical rhymes, as those have a syllable parameter that puts the rhyme into a corresponding category depending on how long it is.

The question is whether you think this is a worthwhile addition: should we display both the Institute-for-Bulgarian-Language–prescribed hyphenation and the syllabification (assuming this process is adequate), or keep just one? Or, use it only internally for categorising rhymes, or for denoting syllable boundaries in IPA, or both... or neither? @Chernorizets @Benwing2

P.S. I coded it locally, and I got the following results:

  • ви-со-чи-на
  • сес-тра
  • плен-ник
  • пре-о-до-ле-я
  • ма-о-и-зъм
  • май-ка
  • айс-берг
  • ма-йор
  • фри-зьор
  • су-джук
  • над-жи-ве-я
  • по-ту-ри
  • звез-да
  • сприн-цов-ка
  • ше-ви-ца
  • про-фа-шис-тки

I had to make an exception for boundary + vowel + consonant → boundary + vowel + boundary + consonant when that 'consonant' happened to be й, because somehow а-йс-берг didn't seem to make much sense. Without more test cases, can't be certain, but so far it's looking somewhat promising, Kiril kovachev (talkcontribs) 21:17, 4 August 2023 (UTC)Reply

@Kiril kovachev I have implemented syllabification for several languages; for most languages other than English, the syllabification algorithm is similar. In most languages, doing this is a necessary step in the process of generating pronunciation, since at the very least the stress mark has to go at the beginning of the syllable. Essentially my algorithm was to place a syllable boundary before each CV cluster as well as between each VV cluster that doesn't form a diphthong, and then move the boundary leftward across consonants where the resulting CC or CCC to the right of the boundary can form a possible onset (with language-specific exceptions). This typically results, for example, in Cl and Cr clusters being grouped together. For the Slavic languages it's more complicated due to prefixes like на- and над-; it looks like you're already handling this by the examples су-джук vs. над-жи-ве-я, and it also means you sometimes need a mechanism to override the default syllabification (which I do by allowing the user to insert a . in the correct place in the respelling, which is respected by the syllabification algorithm). I didn't typically start with implementing hyphenation as a base because hyphenation is usually more complex than syllabification, but it sounds like it will work for Bulgarian. The issue of syllabification vs. hyphenation has come up before in the context of Spanish, for example; we didn't consider displaying both but if you want to do this, it might make sense to display only one line when the hyphenation and syllabification are the same. Benwing2 (talk) 22:18, 4 August 2023 (UTC)Reply
Fortunately I haven't seen any examples where the prefixes are incorrectly handled, or at least not yet, but in hyphenation I know it's possible for them to get split apart since it doesn't respect morphology (the new rules in Bulgarian allow morphological hyphenations, but they're not required), so that may end up being a problem, too. Without having checked out syllabification, it strangely sounds like hyphenation is easier in Bulgarian than syllabification, and principally that's the reason I tried doing it first. Doing both is probably a little sore to look at, so I don't know if we'd even want to do that, but... anyway, at least as you say if they're the same it should be no issue to display them as one. Kiril kovachev (talkcontribs) 23:03, 4 August 2023 (UTC)Reply
@Kiril kovachev unfortunately it's not that simple. The main issue is that, in consonant clusters with more than 2 consonants, there are several valid hyphenations, but not necessarily several valid syllabifications. For example, Syllabification(key): про‧фа‧шист‧ки
  • Hyphenation(key): про‧фа‧шис‧тки yields the invalid syllable "тки", which is invalid because you can't have two stops at the beginning of a Bulgarian syllable. I've seen other such examples - I can go collect them for you if that would help.

The textbook Съвременен български език explains the structure of Bulgarian syllables and the rules for dividing a word into syllables in Chapter 7 - "Сричка".
HTH,
Chernorizets (talk) 22:46, 4 August 2023 (UTC)Reply
@Chernorizets That's fine, I'll make sure to read that first and get informed, because I honestly didn't know there were rules to syllables in Bulgarian as well. I just thought we can roll with what we have from the hyphenation and just make some minor tweaks... and whilst that may still be true, if it is, it'd require a bit more processing than what I put just above. If not, though, I guess it would warrant an entire new function...
Don't worry about gathering examples yet if you don't want to, though - I'll try reading the text myself, so that you can spare yourself the bother. Kiril kovachev (talkcontribs) 23:08, 4 August 2023 (UTC)Reply
@Kiril kovachev sounds good. Btw, a low-priority caveat for both hyphenation and syllabification - in a small handful of Bulgarian words, the combination "дз" actually stands for the affricate /d͡z/, as opposed to the much more typical sequence of /d/ followed by /z/ (as in надзор (nadzor)). It is the reverse situation of "дж" which is almost always the affricate /d͡ʒ/, and we have to use the full-stop workaround to break it apart. We may want another workaround (maybe an underscore? some other character?) to indicate cases where "дз" is a single sound. You can find examples of such words at User:Chernorizets/bg-words-with-dz-affricate. Chernorizets (talk) 00:18, 5 August 2023 (UTC)Reply
@Chernorizets Does the sequence of /d/ + /z/ always occur at a syllable boundary? If so we could assume any tautosyllabic sequence of дз is an affricate, then you could write 'на.дзор' to indicate that дз should be an affricate. Benwing2 (talk) 00:23, 5 August 2023 (UTC)Reply
@Benwing2 considering that there are probably less than a dozen words in the standard language where dz is an affricate, I think it might be easier to just hand-tune their IPA transcriptions and hyphenations rather than encumber the pronunciation module with additional handling. And sorry for being confusing - надзор (nadzor, oversight) is an example of the overwhelmingly common case of /d/ + /z/. An example of the affricate is скръндза (skrǎndza, cheapskate). Chernorizets (talk) 04:23, 5 August 2023 (UTC)Reply
@Chernorizets Thanks for the example. I do think it would be better to support this in the module than simply specify that it's unsupported and require people to manually insert IPA. That adds extra work but also is likely to lead to badly formatted IPA. I can add this support if needed. Benwing2 (talk) 04:28, 5 August 2023 (UTC)Reply
@Benwing2 by "add the support", do you mean adding support for treating the sequence "дз" as the affricate /d͡z/ in words that have been appropriately marked? I'd be in favor of fixing the IPA bit so that {{bg-IPA}} can be invoked in a well-defined way to handle those situations. Please let me know what changes you have in mind, so I can evaluate the impact.

As to your question of whether "[...] the sequence of /d/ + /z/ always occur at a syllable boundary" - in the handful of cases we're talking about where it denotes a single afficate sound, it starts a syllable. It can be a syllable coda in certain foreign personal and place names, e.g. Polish Łódź which in Bulgarian is conveyed as Лодз (Lodz).

Outside of the minority of cases where /d/ + /z/ represents an affricate, the /d/ and the /z/ belong to different syllables, i.e. the syllable boundary runs in-between them.
Thanks,
Chernorizets (talk) 03:47, 6 August 2023 (UTC)Reply
@Chernorizets So what I'm thinking is this: (1) the Bulgarian module has to (or should) do a full syllabification as part of the IPA generation process. At least, all other modules I've worked on work in this fashion. (2) We support the ability to manually specify the location of a syllable division by using . in the appropriate place; this gets respected by the syllabification algorithm. (3) Once the syllabification is done, all occurrences where дз does *NOT* represent an affricate will have a syllable boundary marker between them, so we just convert any occurrences of unseparated дз to an affricate. That way, for example, Лодз automatically has an affricate; скръндза might depending on how the default syllabification algorithm works, and if by default it gets syllabified as скрънд.за, you can write {{bg-IPA|скрън.дза}} to force a syllable division in the correct location, which will make дз an affricate; while надзор will get a default syllabification над.зор and not have an affricate. Benwing2 (talk) 08:27, 6 August 2023 (UTC)Reply
@Benwing2 I'd be inclined to get syllabification implemented for Bulgarian, if for no other reason than to fix the (apparently broken) situation where we don't actually correctly represent the sequence "дж" as the affricate /d͡ʒ/, but rather as /dʒ/ - see джудже (džudže). I suppose the same mechanism that would allow us to indicate a syllable boundary would work for the minority of cases where "дж" is, actually, /dʒ/ because the "d" and the "ʒ" belong to different syllables.

The syllabification-from-hyphenation idea could work, but it would require two passes:
  • breaking up hyphenated components with multiple vowels: e.g. упо-рит → у-по-рит
  • moving consonants between the resulting components so that phonotactic constraints of Bulgarian syllables are respected: e.g. про-фа-шис-тки → про-фа-шист-ки. I'm not sure whether to do this back-to-front or front-to-back.
Alternatively, we distill the rules from the textbook I quoted into a proof-of-concept module first, to run extensive tests and gain some ideas about edge cases. Either I or Kiril can help translate the relevant section. I actually think we might be able to do hyphenation from syllabification, since hyphenation effectively joins some syllables together.
Cheers,
Chernorizets (talk) 09:28, 6 August 2023 (UTC)Reply
@Chernorizets: Sounds good to me. Note that my implementations of syllabification for other languages generally move the consonant boundaries from right to left but maybe for Bulgarian the other way works better. Benwing2 (talk) 01:27, 7 August 2023 (UTC)Reply
@Benwing2 would you mind pointing me to a couple of such implementations? A Slavic, Germanic or Romance example would be great (or anything really with a Greek-derived alphabet because I can't read the rest lol) Chernorizets (talk) 01:38, 7 August 2023 (UTC)Reply
@Chernorizets Some examples:
  1. For Italian, see Module:it-pronunciation#L-578. Note that for many of these modules, there are two syllabification procedures, one that operates directly on the spelling (for use in the "hyphenation"/syllabification display) and one that operates on a partly processed phonemic output (for use in generating the phonemic output). In this module, the version that operates directly on spelling is entirely separate from the version (cited above) that operates on partly processed phonemic output; the version operating on spelling is here: Module:it-pronunciation#L-1030. In some later modules, I fixed this so there's one routine with some conditionals.
  2. For Spanish, see Module:es-pronunc#L-215. This handles syllabification either based on spelling or partly-processed phonemic output.
  3. For Portuguese, see Module:pt-pronunc#L-583. (This link might not work; for some reason, for me this module shows up with raw output rather than nicely syntax-highlighted output.)
  4. For German, see Module:User:Benwing2/de-pron#L-2006. This is a beast of a module, not done yet but mostly working.
  5. For Russian, see Module:ru-pron#L-1332. This is an older module and it mostly implements syllabification from right-to-left, but not completely.
Benwing2 (talk) 02:18, 7 August 2023 (UTC)Reply
@Kiril kovachev @Benwing2 I'm doing a prototype implementation (in Java) based on the textbook I linked to, and chiefly the rising sonority principle, as defined for Bulgarian: fricatives < stops/affricates < sonorants < vowels. I'm gonna show you some preliminary results, and I'll upload the code on GitHub sometime soon.
Things that need work:
  • judicious support for prefixes like без-. Not all words that start with the letters corresponding to a prefix have that prefix.
  • "в" (v) is special - it has higher affinity for vowels than some other fricatives
@Benwing2 thanks for the code samples - I haven't gotten around to them, but will do in the near future. I wanted to give this a go from first principles.
в --> в
с --> с
у --> у
о --> о
ѝ --> ѝ
аз --> аз
ти --> ти
той --> той
тя --> тя
във --> във
със --> със
принц --> принц
спринт --> спринт
глист --> глист
ами --> а-ми
ала --> а-ла
ако --> а-ко
уви --> у-ви
или --> и-ли
саламура --> са-ла-му-ра
барабан --> ба-ра-бан
сполука --> спо-лу-ка
щавя --> ща-вя
стрина --> стри-на
старицата --> ста-ри-ца-та
получените --> по-лу-че-ни-те
подобаващите --> по-до-ба-ва-щи-те
безименен --> бе-зи-ме-нен
изопачавам --> и-зо-па-ча-вам
койот --> ко-йот
майонеза --> ма-йо-не-за
пейоративен --> пе-йо-ра-ти-вен
майор --> ма-йор
воал --> во-ал
маоизъм --> ма-о-и-зъм
феерия --> фе-е-ри-я
воайор --> во-а-йор
миокард --> ми-о-кард
нащрек --> на-щрек
поощрявам --> по-о-щря-вам
защриховам --> за-щри-хо-вам
поощрителен --> по-о-щри-те-лен
джудже --> джу-дже
суджук --> су-джук
манджа --> ман-джа
калайджия --> ка-лай-джи-я
авджия --> а-вджи-я
бульон --> бу-льон
фризьор --> фри-зьор
кьопоолу --> кьо-по-о-лу
шедьовър --> ше-дьо-вър
гьозум --> гьо-зум
ликьор --> ли-кьор
сестра --> се-стра
пленник --> плен-ник
майка --> май-ка
преодолея --> пре-о-до-ле-я
звезда --> зве-зда
спринцовка --> сприн-цо-вка
царство --> цар-ство
профашистки --> про-фа-шист-ки
бързо --> бър-зо
малко --> мал-ко
партия --> пар-ти-я
гледка --> глед-ка
крачка --> крач-ка
цедка --> цед-ка
гланцов --> глан-цов
бездомен --> бе-здо-мен
откачвам --> от-кач-вам
нравствен --> нрав-ствен
мандраджия --> ман-дра-джи-я
мизансцен --> ми-зан-сцен
пепелник --> пе-пел-ник
пилци --> пил-ци
Chernorizets (talk) 09:41, 7 August 2023 (UTC)Reply
Here is the GitHub link. Note: this is still pretty crappy - I'm only sharing it as a conversation aid. Chernorizets (talk) 10:32, 7 August 2023 (UTC)Reply
@Chernorizets Sounds good. I'll take a look at your code. The issue with без- and similar prefixes can't be completely solved automatically; the best you can do is have reasonable, not overly complex defaults and require that anything not matching those defaults requires manual insertion of a . or similar to signal the syllable boundary. Benwing2 (talk) 11:22, 7 August 2023 (UTC)Reply
Are we sure about ми.зан.сцен, бе.здо.мен, а.вджи.я, сприн.цо.вка? Benwing2 (talk) 11:29, 7 August 2023 (UTC)Reply
@Benwing2 these are the examples I had in mind when I wrote that the code needs work. ми.зан.сцен is fine, бе.здо.мен satisfies rising sonority but should be без-до-мен w/ correct prefix handling. The other two show that "v" needs special love, which the textbook does point out. Sorry I didn't annotate which examples were broken. Chernorizets (talk) 12:01, 7 August 2023 (UTC)Reply
@Benwing2 I pushed a change that performs some fixups after the sonority analysis has been done, to either force certain consonant clusters in rising sonority apart, or force certain clusters that don't obey rising sonority together. I'm not entirely happy with having to list individual exceptions like that, but Bulgarian vowels don't have unlimited onset combinations, so I hope that some moderate-length lists can handle the majority of cases. For everything else, we could use the /./ signifier (not implemented yet in my code).
Some previously broken cases that work now:
авджия --> ав-джи-я
звезда --> звез-да
спринцовка --> сприн-цов-ка
бездомен --> без-до-мен
всъщност --> всъщ-ност
посвикна --> по-свик-на
бездна --> безд-на
I'll be adding more test cases to check for outliers. Eventually I'll run this on some wordlist (e.g. the one Kiril used for the anagram bot task) and spot-check examples.
TODO:
  • support for /./ to indicate a syllable boundary (+ test cases)
  • support for multi-part words separated by hyphens or spaces
  • prefix support (where not already handled by fixups)
Cheers,
Chernorizets (talk) 23:00, 7 August 2023 (UTC)Reply
@Chernorizets Sounds good, thanks for the update. Listing individual exceptions sounds OK to me; the Russian syllabification (which predates my work on the module) is handled entirely through a list of possible onsets, and your approach sounds cleaner. Benwing2 (talk) 23:21, 7 August 2023 (UTC)Reply
Thanks for all this, this is looking excellent. I think I shall try to translate the syllabification rules and put them somewhere visible on here, so that people can reference them in English if required. As I was recording some words just now I also found an interesting test case in взвод, what would that produce under your program? (I assume just one syllable, but since there are several вs I don't know what the specialness of в that you mentioned above will do to it.)
@Chernorizets @Benwing2
Separately, how easy would it be to go from either syllabification to hyphenation or vice-versa, compared to just generating each separately via their own algorithms? Given that the two rulesets are evidently a little bit divergent, maybe it'll just be easier to generate each individually? This would sidestep bugs in which one breaks because of a follow-through fault in the other, as well. Kiril kovachev (talkcontribs) 12:30, 8 August 2023 (UTC)Reply
@Kiril kovachev my code has an optimization to treat a word with a single vowel as one big syllable (which it is), so "взвод" would syllabify as itself.
My thinking of running hyphenation on top of syllabification:
  • first, check adjacent syllables and break up consonant clusters if needed. E.g. по-стро-я → пос-тро-я
  • then, make sure no vowel is on its own at the beginning or end of the word. E.g. пос-тро-я → пос-троя
  • handle 2 or more consecutive vowels word-medially. E.g. ма-о-и-зъм → ма-о-изъм
The following hyphenation rules (from the source you link to) apply to syllabification too:
  • A consonant between two vowels links with the second vowel. For example ви-со-чи-на /vi-so-chi-na/.
  • Two equal consonants are separated. For example плен-ник /plen-nik/.
  • The letter й /y/ between two vowels links with the second vowel. For example ма-йор /ma-yor/.
    it's just a consequence of it being a consonant.
  • When the letters дж /dzh/ denote a single consonant, then they are not separated. For example су-джук /su-dzhuk/ (not суд-жук /sud-zhuk/) but над-живея /nad-zhiveya/.
  • There must be at least one vowel before and after the hyphen.
    which is simply to say, don't do things like св-рака, which syllabification wouldn't do
  • No hyphenation before or after ь.
The only rule I'm not sure my syllabification impl already follows, and hence needs no extra work, is:
  • When a sequence of two or more consonants follows й /y/ then at least one consonant links with й /y/. For example айс-берг /ays-berg/ (not ай-сберг /ay-sberg/).
    I'll add some test cases for that.
Chernorizets (talk) 12:57, 8 August 2023 (UTC)Reply
@Kiril kovachev all that said, there are enough differences between syllabification and hyphenation for Bulgarian that I wouldn't push back too strongly against keeping them separate, so that we make it easier to vary one without breaking the other. I would assume, though, that there's a point in time when we'll stop tweaking both of these algorithms, which could be the point when we decide on "combining" them vs. not. I don't know Lua yet (but working on it), so idk how hard or easy this stuff would be in Lua-world - I was just curious about the possibility of deriving hyphenation from syllables.
Chernorizets (talk) 13:03, 8 August 2023 (UTC)Reply
No, it's all good, your suggestion is good, as long as we know there won't be any problems and the differences can all be accounted for, I guess it's fine to generate one from the other. The main argument for keeping them separate is just that if there's little performance difference anyway, which I assume to be the case, it'll just be simpler to port your code once it's finished, whereupon we won't have to integrate hyphenation generation into syllabification, which would not produce any noticeable change. However, your syllabification seems a lot more robust in its approach than the basic pattern-matching hyphenation script of mine, so I wonder if it may be more stable after all to base hyphenation off of your syllabification just for that reason. Minimise the weakest link and so on. I'll see if I can write a converter that implements the steps you laid out, and we can test that against the current hyphenation output too. Kiril kovachev (talkcontribs) 13:19, 8 August 2023 (UTC)Reply
@Chernorizets I wrote a function to convert from syllabification to hyphenation:
:::::::::def hyphenation_from_syllabification(syllabification: str) -> str:
:::::::::    word = sub_repeatedly("-дж", "-#", syllabification)
:::::::::    word = sub_repeatedly("дж$|^дж", "#", word)
:::::::::    word = sub_repeatedly(f"({vowels_c})-({cons_c})({cons_c.replace('ь', '')}+)", "\\1\\2-\\3", word)
:::::::::    word = sub_repeatedly(f"({cons_c})({cons_c}+)-", "\\1-\\2", word)
:::::::::    word = sub_repeatedly(f"^({vowels_c})-", "\\1", word)
:::::::::    word = sub_repeatedly(f"-({vowels_c})$", "\\1", word)
:::::::::    word = sub_repeatedly(f"({vowels_c})((-{vowels_c})*)(-{vowels_c})-", lambda m: m.group(1) + m.group(2).replace("-", "") + m.group(4), word)
:::::::::    word = sub_repeatedly("#", "дж", word)
:::::::::    return word
...which does basically what you suggested, but pre-converting дж in syllable-initial, as well as word-initial and word-final positions into a unit, like I have in the current Lua module, because otherwise the algorithm will attempt to separate the д and ж as per your first algorithm step of breaking up consonant clusters: for all other cases, it attempts to move a single consonant to the left of the break, keeping any remaining on the right.
Anyway, for every word in the above test case list, this script yielded the correct hyphenation as generated by the current hyphenation function, so that's a promising sign, I think. Kiril kovachev (talkcontribs) 19:35, 8 August 2023 (UTC)Reply
@Chernorizets@Kiril kovachev Looks good. In cases where I have to replace a cluster with a single consonant I tend to use Unicode chars in the range U+FFF0 through U+FFFD (which are undefined) rather than chars like # that conceivably might be used for other purposes (e.g. signaling a word boundary). Also keep in mind there is no alternation in Lua patterns so you'd need to break up the second line into two lines in Lua. Other than that it all looks fine. Benwing2 (talk) 20:43, 8 August 2023 (UTC)Reply
@Benwing2 Good point, I'll convert it to one of those characters shortly when I get the time; should I also do this for the hyphenation point character used while generating the hyphenation? I suppose one can imagine if a user entered a hyphenation point into the input of the function it would cause some problems... Thanks for the advice, Kiril kovachev (talkcontribs) 20:47, 8 August 2023 (UTC)Reply
@Kiril kovachev Usually in the code I've written, I end using . to indicate the hyphenation or syllabification point, but since this character is also used by the user to force a syllable break in a specific location, I first replace user-specified . with a character SYLDIV which is defined to be one of the Unicode chars in the range I give above. Maybe it's cleaner just to use such chars for the hyphenation point, but it makes the regexes a bit harder to read. Benwing2 (talk) 20:55, 8 August 2023 (UTC)Reply
@Benwing2 @Kiril kovachev a little status update on my scrappy experiment with Bulgarian syllabification. You can find the code here, and there's a file called syllabification_test_output.txt which shows the test cases I'm using.
Since last time, I've added support for forced syllable breaks using ., as well as an initial version of support for morphological prefixes. A current sore spot for prefix handling is that certain prefixes that need special handling for syllabification - like над- - contain another valid prefix that needs no such handling, in this case на-. You can see the damage in the following test case (copy-pasted from the file I mentioned above):
26. Morphological prefix handling: над- + higher sonority
  • надраствам --> над-ра-ствам ## unrelated problem, ств typically shouldn't be broken apart
  • надмощие --> над-мо-щи-е
  • ненадминат --> не-над-ми-нат
  • безнадзорен --> без-над-зо-рен
  • надница --> над-ни-ца
  • надменност --> над-мен-ност
  • надлъж --> над-лъж ## eek! from here on down
  • надробен --> над-ро-бен
  • надрънкам --> над-рън-кам
  • надраскам --> над-ра-скам
  • надрусам --> над-ру-сам
  • надран --> над-ран
I'll think about some ways to handle that in my current framework, and I welcome ideas. I'm planning an international trip, so I might not be very much around in the next few days, but I'll get to your comments as soon as I can.
Cheers,
Chernorizets (talk) 11:22, 10 August 2023 (UTC)Reply
@Chernorizets Hello, thanks for the update. I don't see how it's even possible to tell whether a word is morphologically prefixed with на- or над- (or just happens to start with either), so surely your . is already as good as it can get? I ran the code with those latter test cases changed to "на.длъж", "на.дробен", "на.дрънкам", "на.драскам", "на.друсам", "на.дран", and it produced:
  • на.длъж --> на-длъж
  • на.дробен --> на-дро-бен
  • на.дрънкам --> на-дрън-кам
  • на.драскам --> на-дра-скам
  • на.друсам --> на-дру-сам
  • на.дран --> на-дран
Which seem fine to me. Are there perhaps circumstances where you can definitely tell whether на- or над- is used depending on sonority in some way? But at any rate consider e.g. надробен: this word is made up of на- (na-) +‎ дроб (drob) +‎ -ен (-en), but what if there were an identical word над- (nad-, over-, out-) +‎ роб (rob, slave) +‎ -ен (-en, past participle ending)? Between these there is no way to tell which one it should be, both could be correct, and it surely depends on the editor to give the right one sometimes, because there's otherwise no telling. Or am I missing something in this picture? Nice work, though, thanks for keeping us updated, Kiril kovachev (talkcontribs) 11:51, 10 August 2023 (UTC)Reply
@Kiril kovachev@Chernorizets Yes, in cases like this the best you can do is have a reasonable default and require an explicit syllable break in places that don't fit the default. In most Slavic languages, на- is more common than над- so it maybe makes sense to prefer a division with на- if possible. Benwing2 (talk) 18:48, 10 August 2023 (UTC)Reply
@Benwing2 exactly, над- seems to occur mostly before "м" and "н", and I already have a feature that would break apart "дм" and "дн" between two vowels. I'll try to see whether the situation improves, and the number of cases which require an explicit syllable break drops, if I simply kick "над" out of my prefix list. The same would be true for "под-", but not "пред-". I'll add tests to confirm. Chernorizets (talk) 21:27, 10 August 2023 (UTC)Reply
@Benwing2 @Kiril kovachev here's my (probably) last update on the prototype syllabifier:
  • I got rid of "над-" and "под-" from the morphological prefix list and, in general, ended up with way fewer cases where an explicit syllable boundary needs to be provided. This is what we surmised/expected.
  • I've added a lot more content to the README.md:
    • A translation of the syllable breaking rules from the Bulgarian textbook
    • A (hopefully) thorough explanation of how I've implemented that
My hope is that at this point, there's enough information about my sources and methods that you could port this to Lua after trying more test cases and tweaking some of the algorithm's cranks and levers. If there are specific test cases you'd like me to try out, or questions you have about the code, I'm happy to help with both.
Thanks,
Chernorizets (talk) 06:22, 12 August 2023 (UTC)Reply
@Chernorizets@Kiril kovachev Awesome, thanks! Kiril, can you port this to Lua? Benwing2 (talk) 07:13, 12 August 2023 (UTC)Reply
@Chernorizets Excellent work again, thanks! I apologise for my tardiness in providing the syllabification rules translation, it looks like you got the jump on me. @Benwing2 Of course, I would love to; however, I ought to announce I won't be free for all of today nor tomorrow morning: sorry for this unnecessary delay; nevertheless I'm excited to get this working...! I'll let you know when I become free again and I've gotten started. Kiril kovachev (talkcontribs) 09:48, 12 August 2023 (UTC)Reply
@Kiril kovachev OK no problem, take your time with your real-life issues :) ... Benwing2 (talk) 23:36, 12 August 2023 (UTC)Reply
That's okay, it wasn't a real issue per se :) I was just celebrating my friend's 18th birthday, but I've now returned and am about to start porting. Thanks for your patience, though, Kiril kovachev (talkcontribs) 14:10, 13 August 2023 (UTC)Reply
@Benwing2 @Chernorizets Please check out the latest Module:User:Kiril kovachev/bg-pronunciation. I ported all the code, and it largely works, barring a few exceptions that, for whatever as-yet-undiagnosed reason, are still not generating as in the original. I believe I'm being vexed by some nasty off-by-one errors due to Lua indexing from 1 instead of 0, meaning some of the literal translations from Java were not thoughtful enough on my part and are problematic. I don't know, I made a few changes that fixed many of the incorrect instances, but some still remain. I think I'll try again tomorrow to fix them, but I'm glad we got this much done so far at least. I also copied the testcases as well, which currently tests for parity with the Java program's output – are there any tests that the original program generated strange syllabifications for, or is all in order now? Kiril kovachev (talkcontribs) 18:39, 13 August 2023 (UTC)Reply
@Kiril kovachev I took a cursory look, and I'll look again in the next day or so. Here are a couple of thoughts.
If you already know this, sorry for belaboring the point. Java strings are indexed starting with 0. I'll use the string "котка" as an example:
  • its length() is 5.
  • valid indices for the charAt method go from 0 to length() - 1, or in this case from 0 to 4
  • per the contract of the subString(startIdx, endIdx) method, the returned substring starts at character startIdx, and ends at character endIdx - 1 of the original string. There are several consequences to this:
    • the length of the returned substring is endIdx - startIdx
    • subString(startIdx, startIdx) returns an empty string
    • the largest possible endIdx is the length of the original string
I abide by those conventions in my code as well. The meat of the index-based logic occurs when handling the consonants in-between two vowels - let's call their 0-based indices firstVowel and secondVowel. Then:
  • the consonant cluster in-between them starts at index firstVowel + 1, and ends at index secondVowel - 1. In other words, subString(firstVowel + 1, secondVowel), with length secondVowel - firstVowel - 1.
  • the case of no consonants between two vowels is equivalent to secondVowel == firstVowel + 1, which is to say they are adjacent characters in the string. You can also see that by observing that in this case, secondVowel - firstVowel - 1 is precisely 0.
  • in the case of "котка", firstVowel == 1, secondVowel == 4, and subString(2, 4) == "тк".
It may make your life significantly easier to rewrite portions of this code to use regular expressions - as is apparently common practice here - instead of character index math. I wrote it that way because I was rusty on the JDK classes and methods for regex handling.
As for test cases that don't syllabify as nicely as they could/should (?):
  • безразборен --> без-ра-збо-рен
    case of two (or more) prefixes. "разборен" would have given "раз-бо-рен". This situation is lacking from my test cases.
  • безчестен --> без-че-стен
    not wrong per se. Per the textbook, the "ст" causes the syllable break to be ambiguous. Another option (better IMO) would have been без-чес-тен, but I don't think we can (or should) do much about it. There are plenty of other words where the "ст" belongs with the next vowel, as in по-сти-гам.
  • надве --> над-ве
    this is rare enough that you can fix it with an explicit break: на.две
  • над.раствам --> над-ра-ствам
    this follows the note from the textbook that "ств" should be kept together; however, this specific case shows that we don't handle suffixes like "-вам". This should have been над-раст-вам. Suffix support would be a new feature.
  • надраскам --> на-дра-скам
    similar to the previous one. Here you can claim, per the textbook, that the syllable boundary is ambiguous. You'd get на-драс-кам if we handled the suffix, which is added to other onomatopoeic interjections - крякам, пискам, мъркам, etc. I feel like there aren't hundreds of verbs like that, though, so we could explicitly put the syllable break.
  • подвижен --> под-ви-жен
    should be по-дви-жен. A relatively small number of Bulgarian words start with "дв-", but then you could have prefixes in front. "v" is just different from the other fricatives, as the textbook does say.
  • подклаждам --> под-кла-ждам
    should be под-клаж-дам. "жд" might need to be added to the list of clusters to break.
Not in my test cases, but we should make the module in general aware of /w/ in loanwords from English. Otherwise "уеб" will syllabify as "у-еб", and "даунлоудвам" will syllabify as "да-ун-ло-уд-вам". Since it's a marginal phoneme, perhaps we need a counterpart to . that means "don't break here". That, or we need some sort of loanword database, because I don't think it would be easy to write a code rule. E.g. "даун" is one syllable, but "паун" is two - there doesn't seem to be a clear pattern.
HTH,
Chernorizets (talk) 02:52, 14 August 2023 (UTC)Reply
@Chernorizets@Kiril kovachev Thanks for all the details. For cases like уеб and даун, if these are truly pronounced as a single syllable, I think the best course of action is to use a special symbol in respelling to indicate this; the obvious one is ў from Belarusian. As for надраствам, I wonder if adding special handling for suffixes will make things get too complicated. There are bound to be exceptions in both directions and it will get hard to keep track in one's head of how the syllabification algorithm works (so as to know when to insert manual syllable breaks). Conversely, I do think handling two prefixes would be reasonable, and I do this in the Russian code. Benwing2 (talk) 03:03, 14 August 2023 (UTC)Reply
@Benwing2 older speakers tend to pronounce the "у' in "уеб" as /u/, but younger speakers - especially those who have exposure to English, or those who have l-vocalization in their speech - pronounce it as /w/. I like the idea of using a special character for this, and indeed the Belarusian alphabet has just the one.
I also agree that suffix handling would complicate things further - I guess this is something we can revisit when we observe how often we need to put in explicit syllable breaks.
RE: multiple prefixes - if there's a non-hacky way of tackling that, I'm in.
Chernorizets (talk) 03:10, 14 August 2023 (UTC)Reply
@Chernorizets Well, it seems it should be possible to segment off the prefixes one by one, although what I did in the Russian module Module:ru-pron#L-355 probably qualifies as hacky; there's a list of prefixes including some double prefixes (безраз- is one of them), and in addition the module checks for any prefix preceded by не-, по- or непо-. This was done specifically for deciding when to treat written double consonants as a geminate, because Russian has a lot of written double consonants that aren't pronounced double, but it should be applicable here as well. Benwing2 (talk) 03:21, 14 August 2023 (UTC)Reply
@Chernorizets Thanks for your detailed summary, it was helpful when trying to figure out if there are still any index-related bugs in the code. In hindsight, it may have been better to just emulate the Java-style indexing using custom functions in Lua, but it is now what it is... @Benwing2 Anyway, as an update, I fixed the majority, 11 of 12, of the remaining failing cases, but this last one, надраствам, is a real bother; I'm debugging with the help of your original code as well, but I think again I'll leave the rest of it till tomorrow. The actual cause of the previous bugs, btw, was just a typo in the matches function. I begin to deeply understand why they use compiled languages in enterprise after this encounter... Fortunately we are nearing 100% with this new fix, at least! Kiril kovachev (talkcontribs) 20:37, 14 August 2023 (UTC)Reply
@Kiril kovachev Thanks for the bug fixes. I'm not a real fan of having shims to emulate one language in another; they help in the short run but make maintenance more difficult. Can you tell me what is going wrong with надраствам? Maybe I can take a look. Benwing2 (talk) 21:07, 14 August 2023 (UTC)Reply
Thanks for offering to help, and I would love to point out what the exact, code-related problem may be, but I can't tell with certainty; all I can say is the original Java program generates над.ра.ствам, whereas the current module generates на.дра.ствам. If I had to wager a possible connection (I did a small amount of debugging on this case, but I gave up for the time being), then the Java code, after the line that calculates nextOnset / next_onset (Lua line 690), within syllabifyPoly, has the values nextOnset = 3, prevVowel = 1, i = 4, whereas the module has next_onset = 3, prevVowel = 2, i = 5, in the first iteration (if you put a mw.log(next_onset, prev_vowel, i) directly after line 690, this'll print the values as I see them): if I'm not mistaken, this is supposed to be 4, 2, 5, because the indices are meant to be one above their values in Java.
In the case of this particular word, execution of code falls all the way down to the last line of fixupSyllableOnset, returning the passed-in sonorityBreak calculated on line 660, passed on 664; changing this final line of fixup_syllable_onset to return sonority_break + 1 fixes this particular word in the Lua version, but ruins loads of other now-working testcases. Therefore I feel there is some kind of problem that occurs, starting with either where the sonority_break is calculated, or with fixup_syllable_onset (which depends on the former), as the index generated appears to be off by 1.
This is what happens when you call the original Java, with the same input and the same debug line to print out that triplet of values:
  • 3 1 4
  • 5 4 8
But the module gives these values:
  • 3 2 5
  • 6 5 9
We want this 3 to be a 4, and it'll work.
Btw, to assure you that this is what we want (i.e. all these three need to be 1 higher than their Java counterpart), check out the syllabification of чо.век, which both codes get right:
  • 2 1 3 (Java)
  • 3 2 4 (Lua)
Also, антиконстиционалност (= ан.ти.кон.сти.ци.о.нал.ност):
Java:
  1. 2 0 3
  2. 4 3 5
  3. 7 5 9
  4. 10 9 11
  5. 12 11 12
  6. 13 12 14
  7. 16 14 17
Lua:
  1. 3 1 4
  2. 5 4 6
  3. 8 6 10
  4. 11 10 12
  5. 13 12 13
  6. 14 13 15
  7. 17 15 18
As you can see, the desired result, if I'm thinking correctly about this, is that the Lua should have Java values + 1. For whatever reason, some part of the process where next_offset (the first value of this triplet) is calculated is wrong, specifically for (cases like) надраствам, and is falling short of the desired index. By this point I hope that a) what I'm saying is correct and isn't totally wrong and may misguide us both, and b) that it will be fairly easy to fix by just checking everywhere where it could go wrong starting from here.
If this is too much to do all at once, I'll do my best again tomorrow, but sadly I've got to sleep for now... I hope this is of help. Thanks again for your offer of help! Kiril kovachev (talkcontribs) 21:51, 14 August 2023 (UTC)Reply
As for compiled languages, Lua is especially bad with its lax parameter checking and its allowance for undeclared variables silently having a value of nil. I think there is a "strict mode" to catch instances of the latter but the lax parameter checking can't be helped because it's a "feature" of the language and it's how you implement optional parameters in functions. Benwing2 (talk) 21:11, 14 August 2023 (UTC)Reply
Yeah, this is understandable on the one hand, but I simply despise the bugs that come because of it. Specifically, mistyped variables and missed-out parameters are effectively at the core of most Lua bugs I've written up until now—the actual problematic-logic-related ones have been super sparse in comparison. If only the system of parameters were a bit more fleshed out, this kind of bug would be much less common, but I hear Lua is specifically lightweight like this in order to be small and portable, so I guess there are compromises to be made, too, that can't just be fixed without costing something else. Kiril kovachev (talkcontribs) 21:54, 14 August 2023 (UTC)Reply
@Kiril kovachev Please take a careful look at the test cases in my test file. I changed "надраствам" to "над.раствам" to force a syllable break there (line 213 in the test file). A few of your other test cases which use explicit syllable breaks don't need them, since I removed "над" from the prefix list several changes ago. This might explain the discrepancy you're seeing. Chernorizets (talk) 10:00, 15 August 2023 (UTC)Reply
@Chernorizets I see, I may have forgotten to pull the latest changes on my local build. Sorry if that's the case... Kiril kovachev (talkcontribs) 12:29, 15 August 2023 (UTC)Reply
@Benwing2 @Kiril kovachev FYI I've now implemented multiple prefix support - turned out to be a straightforward modification of the existing code in PrefixSeparator. You can see the updated test output file for examples. I also added "жд" to the consonant clusters to break, which fixed a couple of cases.
At this point, feature-wise, the TODOs are:
  • handle compound words with spaces, or connected with hyphens. Should be as simple as tokenizing the string and invoking the syllabification routine on each piece.
  • (low priority) handle /w/, both as a regular consonant in the IPA transcription, and for the needs of syllabification/hyphenation, likely by introducing Ў ў. I can code that up if there's interest.
  • (low priority) handle mixed Latin/Cyrillic, as in SIM карта (SIM karta). IMHO, this case is rare enough that we can "handle" it by manually writing out the syllables.
Chernorizets (talk) 09:32, 15 August 2023 (UTC)Reply
Went ahead and just did /w/ :-) You can see examples with or without using ў in the test file. Chernorizets (talk) 10:48, 15 August 2023 (UTC)Reply
@Chernorizets Thanks, I ported the ў-related changes, as well as the support for multiple prefixes. Things are looking good. The only divergence my code made from yours was the treatment of ў as a sonorant: according to w:Sonorant, semivowels such as /w/ are included. I found it would be simpler to simply put it in the correct phonetic set (those which we had already), rather than to make special exceptions in the isConsonant function and other places; but this produces a discrepancy in the word даунлоуд: we get дау‧нлоуд instead of даун‧лоуд. Is this legit, or is this problematic? Why should /w/ get a higher sonority than other sonorants?
By the way, I also made a normalization in which у + combining breve accent is converted into ў as a single letter; I don't know how you've been typing it, but I don't have a Belarusian keyboard, so I imagined using this in the same way as we do accents in the IPA template, i.e. adding a combining diacritic instead of typing a distinct, composed character. Should we also strip off acute/grave accents in the output syllabification, or is it better to keep them? Thanks, Kiril kovachev (talkcontribs) 16:52, 15 August 2023 (UTC)Reply
@Kiril kovachev I would go to the Belarusian alphabet article on Wikipedia and copy-paste it from there :-) It's a single character in Unicode, so idk how correct it is to treat it as y + combining accent - maybe that's a Benwing question.
I thought about including it in the sonorant category for the very reason you pointed out, but as an essentially borrowed morpheme, I wasn't sure about it. It might simplify some things though. "даунлоуд" is expected to break (which influenced my coding choices) because you end up with three sonorants in a row, which doesn't happen in Bulgarian. The algorithm will, predictably, split the first and the second one, marking the boundary as у-нл. To fix that, you could slightly abuse the list of clusters to break, and add this one by specifying that the break should be on the "л". I say "abuse" because, except for this one example, the list does only contain clusters in rising sonority. What it buys you, though, is that you could remove the "ул", "ур" entries from that same list. If /w/ is treated as a sonorant, they would be broken up by default.
RE: stripping off acute/grave accents in the output - probably? I don't usually see stress marked on a syllabified word. Nice observation! Chernorizets (talk) 19:18, 15 August 2023 (UTC)Reply
@Chernorizets If we do this (add ун-л as a broken cluster), will that be safe and not negatively affect the other normal cases? I can't think of many words with this cluster, in honesty, so it would seem fine to me, but just to be sure.
Also, thinking some more, it occurs to me that if we want to feed the syllabification as an input to the IPA generation, then we mustn't strip the accents until after this's happened, i.e. only when it comes time to display the syllabification itself. Kiril kovachev (talkcontribs) 18:01, 16 August 2023 (UTC)Reply
@Chernorizets@Kiril kovachev I think it might be cleaner to treat ў with the same sonority as й. Benwing2 (talk) 19:47, 16 August 2023 (UTC)Reply
@Benwing2 I agree. If there are discrepancies with the scheme that the Java program uses — they can be corrected with syllable breaks. By the looks of it, these are very rare anyway, though, especially given the rarity of ў in the first place. Kiril kovachev (talkcontribs) 22:03, 16 August 2023 (UTC)Reply
@Kiril kovachev OTOH I think it would be safe, and it would allow us to remove the two other clusters with ў that I added manually. It's a small-enough change that I'd say - just give it a shot :-) Chernorizets (talk) 22:04, 16 August 2023 (UTC)Reply
@Chernorizets Sorry for misunderstanding, but do you mean to treat ў separately + with sonority 3 like you wrote it initially, or with the same sonority as й and just in the sonorant character set? Kiril kovachev (talkcontribs) 22:09, 16 August 2023 (UTC)Reply
@Kiril kovachev I think it would be safe to treat it as a sonorant as you have (and as linguistics dictates lol) I just made the necessary changes to the Java code and no test cases broke - you can take a look if you'd like. Chernorizets (talk) 22:13, 16 August 2023 (UTC)Reply
Haha, okay, I shall take a look tomorrow when I get the chance — anyway sounds good :) Kiril kovachev (talkcontribs) 22:17, 16 August 2023 (UTC)Reply
@Kiril kovachev on second thought, for даунлоуд specifically, I think we should just provide an explicit syllable break rather than adding the cluster to the exception list. Otherwise, we'd also need to add the cluster in даунгрейд etc. Our goal isn't to syllabify English - the special character that represents /w/ handles common words like уиски and боулинг, and IMO that's good enough. For cases that don't boil down to simple sonority decisions, IMO we should just use the special character in conjunction with explicit syllable breaks. Chernorizets (talk) 23:01, 16 August 2023 (UTC)Reply
@Chernorizets I agree we shouldn't add the cluster to the exception list, but if we treat ў as a sonorant, will it not automatically handle cases like даунлоуд and даунгрейд? Benwing2 (talk) 01:58, 17 August 2023 (UTC)Reply
@Benwing2 it won't. "даўнлоўд" contains a cluster of three successive sonorants - ўнл - which is pretty rare in Bulgarian. Looking in the Bulgarian corpus, the only lemma examples that I find are оформление (рмл) and детайлност (йлн), and some non-lemma forms of коктейл and портфейл. The code currently determines a syllable break when sonority stops going up, which means that it will deduce a syllable break between the first and second of three consecutive sonorants. So you'd get даў-нлоўд, о-фор-мле-ни-е and де-тай-лност. Of these three, only оформление is fine, because мл- is a valid word-initial cluster and doesn't sound odd. The first and third would need an explicit syllable break.

In "даўнгрейд", the cluster is "ўнгр" - again, sonority stops rising after the first sonorant, in this case ў. So you get даў-нгрейд.

Tentatively, I can see a rule that says "a cluster starting with 2 sonorants breaks on the following consonant", but idk that it's worth adding for such a small class of words.
Chernorizets (talk) 02:32, 17 August 2023 (UTC)Reply
@Chernorizets I see. I would suggest, if possible, to not treat all sonorants equal, but give them a hierarchy of sonorities like this: ў/й > р > л > м,н. (I don't know if this means that детайлност gets broken as детайлн.ост; I hope not, as there should be a rule requiring at least one consonant at the beginning of a syllable.) Benwing2 (talk) 02:38, 17 August 2023 (UTC)Reply
@Benwing2 if someone wants to figure out, implement and regression-test this, more power to them :-) IMHO, given the applicability, it's a needless complication. Chernorizets (talk) 04:03, 17 August 2023 (UTC)Reply
@Benwing2 I implemented the hyphenation-from-sylllabification scheme we discussed above, and it appears to be mostly fine, just with some hiccups resulting from the current handling of spaces & hyphens in syllabification; I take it those will resolve when we do the tokenization thing and the hyphenation transformation will then receive correct input.
@Chernorizets Also, I want to alert you to a potential issue, which is the output of айсберг: your current program appears to generate ай-сберг, but shouldn't it be айс-берг? This may also be correct, but as a hyphenation it wouldn't be. It may be possible to correct for this in the hyphenation transformation itself, as long as this is a fine syllabification. Thanks, Kiril kovachev (talkcontribs) 18:55, 15 August 2023 (UTC)Reply
@Kiril kovachev "с" has a lower sonority than "б", so the algorithm wouldn't try to break up "сб". Here are the top ~15 words by frequency in the Bulgarian National Corpus that have "сб" between two vowels:
страсбург 4262
стихосбирка 2694
дисбаланс 1626
пасбище 822
лесбийка 767
водосборен 740
Експресбанк 668
Хебросбанк 513
йоханесбург 487
несбъднат 357
Висбаден 283
Агробизнесбанк 282
Дуисбург 247
Кросби 241
лесбийски 214
So "стихосбирка", "водосборен", and "несбъднат" are examples where the "сб" should be together. I'd handle "айсберг" with an explicit syllable break - "айс.берг". I'm also reminded that we need to include "дис-" in the list of prefixes, so that we get дис-ба-ланс instead of ди-сба-ланс. Chernorizets (talk) 19:30, 15 August 2023 (UTC)Reply
@Chernorizets Ah, I see, that makes sense, I was mistaken to think that it would generate the same syllable boundaries as the hyphenation patterns. Also I'll add the дис- prefix in the module right now. Kiril kovachev (talkcontribs) 22:29, 15 August 2023 (UTC)Reply
Also, FWIW, a Bulgarian researcher tried to throw ML at it: https://ideas.repec.org/a/vra/journl/v7y2018i3p133-139.html. The paper is a bit flimsy on examples, though. Chernorizets (talk) 03:02, 14 August 2023 (UTC)Reply
Would be interested in seeing what they say but I can't read Bulgarian and Google Translate just makes a mess of it. Not completely sure how you can do this unsupervised; it would seem you need at least some training examples to help it along. As you mention there are hardly any examples or formulas, which makes me suspicious of the quality of the paper. Benwing2 (talk) 03:08, 14 August 2023 (UTC)Reply
@Benwing2 Here's the paper's main idea:
  • compile a wordlist of the language based on a representative corpus
  • for each word, consider the different ways it can be broken into syllables, using some universal syllable rules from the quoted literature. The example given is the word "червен", with possible segmentations; че-рвен, чер-вен and черв-ен.
  • applying this breakdown to each word in the wordlist, count the occurrences of all putative syllables, grouping by whether they're word-initial, medial or final.
  • going back to the червен example, evaluate each possible segmentation by adding up the frequencies of the syllables, paying mind to whether they're initial or final. The segmentation with the highest "score" wins.
Since there's no real data on how well this works, IMO the main useful part of the paper is the bibliography, which summarizes the state of the art in either universal rules for syllabification, or semi-supervised/unsupervised approaches.
Chernorizets (talk) 03:23, 14 August 2023 (UTC)Reply
@Chernorizets Thanks. Yes, this approach makes a certain amount of sense but (a) the lack of data means this would never get accepted in your typical ML conference; (b) it would need to be way more sophisticated to be usable in real-world situations (and hence it would need a semi-supervised approach that can leverage training data); (c) I've thought for awhile about how Wiktionary can leverage ML approaches but it's hard: unless you have close to 100% precision you can't realistically deploy it on its own, meaning in practice a sophisticated non-ML algorithm may be better. At least with non-ML algorithms it's possible to anticipate where the algorithm will go wrong and add exceptions (like manual syllable breaks) accordingly, but with an ML algorithm you just have to run it and see. That slows down your human-in-the-loop approach a lot, and if you tweak the algorithm (causing some things to get fixed but others to break in unpredictable ways), everything goes to hell. Benwing2 (talk) 03:41, 14 August 2023 (UTC)Reply
@Benwing2 you've summed up my conundrum with ML today - we know that it works, just not (always) why it works, or what the model means. In a way, I think of it as gravity in Newton's time - he could describe it and write down the model (=formula) for it, but it wasn't until Einstein that we understood what the model really meant. On a more practical note, idk how you'd even deploy a ML model here without erecting a ton of infra first. Chernorizets (talk) 03:55, 14 August 2023 (UTC)Reply
@Chernorizets Yes, you'd need support from the MediaWiki developers, unless it was a really simple model (like logistic regression or something) where you could conceivably write the inference part in Lua and have a module containing the training data (maybe?). Benwing2 (talk) 03:59, 14 August 2023 (UTC)Reply
@Chernorizets, @Benwing2: Hello again, I just now finished two main finalizations to the syllabification code:
  1. Made it tokenize words, and then just syllabify each one individually, putting them back together at the end. This basically fixed most of the previous bugs, where that was going wrong.
  2. Made it so that it doesn't casefold its output anymore; now, it just does this internally within its various functions, so that the final user-facing output keeps the same case as it started out with, e.g. Вайерщрас -> Ва.йер.щрас.
With this, is there anything more left to do about syllabification? I feel we're in a very good place now, the whole groundwork has been laid, and I'd say it's fairly polished too. The above edge cases have now been fixed and all. What are the next steps from here? Kiril kovachev (talkcontribs) 01:03, 20 August 2023 (UTC)Reply
Also I forgot, I still haven't ported these, they're still at Module:User:Kiril kovachev/bg-pronunciation. Kiril kovachev (talkcontribs) 01:05, 20 August 2023 (UTC)Reply
@Kiril kovachev Do you have something that's production ready? If so we can put it in place after some testing. Benwing2 (talk) 01:23, 20 August 2023 (UTC)Reply
@Benwing2 Yes, it should be; what's currently in that module is, at least in my eyes, suitable for deployment. As in, it has (pending feedback) all the syllabification features I think we need, all the tests under the testcases page pass, and I think there aren't any areas left unfinished. That reminds me of one exception, which is that I still need to update the hyphenation-from-syllabification function so that it works with uppercase letters as well, but that's a small fix.
That said, it just occurred to me that it may now be possible to use the syllabification output as input to the IPA, so perhaps that could be something to finish before deploying anything. However, as its own unit, I guess that syllabification is basically done. Kiril kovachev (talkcontribs) 01:43, 20 August 2023 (UTC)Reply
@Kiril kovachev code looks good, please add some test cases to the syllabification routine for compound words with hyphens and/or spaces. Other than that, it looks ready.
Is there a goal to eventually have syllabification alongside hyphenation for Bulgarian terms? For now, the only customer of the syllabification logic is hyphenation.
Chernorizets (talk) 05:13, 20 August 2023 (UTC)Reply
@Chernorizets Code-wise: very well, I'll a) fix the hyphenation problem from above and b) get some more tests in.
Syllabification-wise: I'm not really sure, I guess we need to decide which to keep; I think the main obstacle to including syllabifications until now has been the lack of code for it, but it's completely feasible to make a template and have it displayed in parallel with the hyphenation we've been showing already. It's just, I don't know how much utility there is in having both—although I would personally appreciate having both, the difference may be lost on people, because it's not immediately obvious what the difference is meant to be, or what rules there are for either.
Also, what label should be displayed if the two are the same? Right now it's easy to make something like this:
  • IPA(key): [t͡ʃo̟ˈvɛk]
  • Syllabification: чо‧век
  • Hyphenation: чо‧век
  • Rhymes: -ɛk
...of course this is a little redundant. Maybe just "Syllabification and hyphenation"? Kiril kovachev (talkcontribs) 19:07, 20 August 2023 (UTC)Reply
@Kiril kovachev take a look at assimiloitua - there's a "key" next to the Syllabification header which takes one to an appendix about Finnish syllabification and hyphenation. I guess they've taken the approach of showing the syllabification, and then explaining in the appendix where hyphenation might differ. What I think we can take away from this is the idea of having an appendix people can go to if they see something that surprises them.

As for template support, I wonder if we could make {{bg-hyph}} return both, with perhaps an optimization to only show one header if syllabification == hyphenation. Just an idea. Chernorizets (talk) 22:50, 20 August 2023 (UTC)Reply
@Chernorizets @Kiril kovachev Yes, I would be in favor of this approach of showing only one if they are the same. Benwing2 (talk) 22:55, 20 August 2023 (UTC)Reply
@Benwing2 @Chernorizets Both of these sound like great ideas to me too. I was also entertaining the idea of a key, and it would basically explain what you need to know to form the hyphenation from the syllabification, but wouldn't it be nicer if we could display the hyphenation anyway, so that people wouldn't need to remember the rules and form it themselves? I like how it works in {{es-pr}}: the extra details are collapsible, so they don't take up space unless you want them to, which I think is a good approach; we could have it show e.g. syllabification, and if you hit "show more", then drop down hyphenation as well. Of course, having a key would still be valuable to explain both: we could transcribe the rules for hyphenation and syllabification there for reference.
What should the header be if syllabification and hyphenation are the same? Kiril kovachev (talkcontribs) 00:10, 21 August 2023 (UTC)Reply
@Kiril kovachev I dunno? "Hyphenation/syllabification"? Benwing2 (talk) 00:14, 21 August 2023 (UTC)Reply
@Benwing2 Sure, that's good. I suppose there's no need to overthink it, let's just put something suitable and go with it. It can always be changed anyway if we need it to be. Kiril kovachev (talkcontribs) 00:17, 21 August 2023 (UTC)Reply
How's this? I updated my local version of bg-hyph to work like this:
First example, joined
  • Hyphenation(key): при‧мер
Second example, separate
  • Syllabification(key): на‧здра‧ве
  • Hyphenation(key): наз‧дра‧ве
@Benwing2, Chernorizets Sorry if the formatting looks off, it's behaving a little funny in the reply preview... but more or less is this what we're going for? Kiril kovachev (talkcontribs) 00:40, 21 August 2023 (UTC)Reply
@Kiril kovachev (can't use the Reply button anymore lol) Let's just keep it Hyphenation when they're both the same, rather than inventing a new "mashup" header. It seems like hyphenation is more common than syllabification on Wiktionary, so it doesn't hurt to stay consistent with that. We can use the appendix we create to clarify that where syllabification and hyphenation agree, we only show hyphenation. Chernorizets (talk) 00:58, 21 August 2023 (UTC)Reply
@Chernorizets (lol, sorry about that. still works for me... ) Fair, I'll change it to that then. This is also pending the addition of a superscript to link to that key. I can't remember, did we ever transcribe/translate the syllabification rules from that Bulgarian textbook? Those anyway would definitely make a good addition there. Kiril kovachev (talkcontribs) 01:09, 21 August 2023 (UTC)Reply
@Kiril kovachev there's a translation in the README.md file in my syllabification prototype, in the Overview section - take a look, and feel free to change it as you see fit. The text before the rules introduces some of the finer points of syllable breaking, but I haven't translated all that. Chernorizets (talk) 01:20, 21 August 2023 (UTC)Reply
@Benwing2 @Kiril kovachev FWIW, I reached out to both the Institute for Bulgarian Language at the Bulgarian Academy of Sciences, and a couple of phonology experts at Sofia University (SU) to check if there are any newer or more detailed theories of syllabification for Bulgarian which might e.g. make finer distinctions between different sonorants, or handle the weirdness of "v" (which often behaves as if it were a sonorant). I got a reply from one of the professors at SU. TL; DR - what we have is probably as good as it gets, and there isn't much new research happening on syllabification specifically.
A few ideas that might be valuable for a future iteration of the algorithm:
  • favor syllabifications that maximize the number of open syllables, e.g. syllables ending in a vowel. We don't always do this due to our choices of which consonant clusters to explicitly break apart, even if sonority analysis would've kept them together.
  • for our list of consonant clusters to break - even when they're ordered in rising sonority - favor clusters that are rare word-initially. I was already sort of doing that by using a frequency-ordered corpus-based Bulgarian wordlist when determining these clusers, but it can be made more explicit.
  • "v" was a sonorant in the past, and might still be closer to sonorants than to fricatives (since it really derives from Proto-Balto-Slavic /w/)
And yes, sonorants aren't all equally sonorous: m < n < r, l; no mention of j, w. However, I don't see a need to make implementation changes in view of this, since existing rules take care of native sonorant clusters well enough.
Cheers,
Chernorizets (talk) 06:32, 26 August 2023 (UTC)Reply
@Chernorizets Thanks! Maybe add a FIXME comment in the code indicating what you think needs to be fixed long-term; this way, it's more likely to get done eventually than if it just hangs out in a discussion page. But otherwise I think what we have is fine. Benwing2 (talk) 06:36, 26 August 2023 (UTC)Reply
@Chernorizets Nice, thanks for reaching out. It's good to hear we're not missing out on any cutting-edge methods. I like the idea of favoring open syllables—how would this translate to code? Maybe when deciding where to split, identify if it's possible to form a valid syllable that ends in just the vowel? I feel like this might involve some lookahead, however, so, if I'm understanding correctly, that sounds like a small step up in difficulty.
Either way, I'm also happy with what we have now; I think it's stable and sophisticated enough that the occurrence of errors will be really rare.
BTW, I added some multi-word testcases to the WIP module, but there are a few mistaken ones on there—I'm guessing these stem from just my ignorance of the expected syllabifications, e.g. после becomes по.сле, instead of пос.ле as I had expected. There are 5 error cases—do you mind confirming whether these are actually meant to be passing, or whether they need manual overrides? Thanks again, Kiril kovachev (talkcontribs) 10:41, 26 August 2023 (UTC)Reply
@Chernorizets @Benwing2: shall we port the syllabification update to the main module now?
Also, I wrote an additional note at Wiktionary:About Bulgarian describing the difference between hyphenation and syllabification. Would we also want a sort of annotation included in {{bg-hyph}}'s output to link to that explanation, like the (key) annotation on the IPA template? Kiril kovachev (talkcontribs) 20:46, 11 December 2023 (UTC)Reply
@Kiril kovachev No objections from me. Benwing2 (talk) 22:20, 11 December 2023 (UTC)Reply
@Kiril kovachev which syllabification update are you referring to? Rather, is the behavior of {{bg-hyph}} going to change as per what you added to WT:ABG? As for the idea of having an annotation to take readers to a page where they can see the details of Bulgarian syllabification vs hyphenation - I'm on board. In fact, that's what we have for Finnish (see e.g. huomenta). Chernorizets (talk) 00:10, 12 December 2023 (UTC)Reply
@Chernorizets Yes, exactly that, that same syllabifier you coded up on GitHub first that we then ported to the module. The update would indeed bring it in line with the update I made to ABG a moment ago. Kiril kovachev (talkcontribs) 00:16, 12 December 2023 (UTC)Reply
@Kiril kovachev I honestly thought we had already ported that :-) Go ahead, and please monitor for a while to make sure we don't have module errors showing up on Bulgarian pages. Chernorizets (talk) 00:36, 12 December 2023 (UTC)Reply
@Chernorizets Whoops, I think we might've been discussing some new addition or something and forgot about it — I think we still wanted to figure out how to treat „дз“ as a single affricate, which we never did sort out: I'll experiment some these days with the underscore-as-a-merger idea perhaps.
Anyway, now copying it over... Kiril kovachev (talkcontribs) 00:39, 12 December 2023 (UTC)Reply
@Kiril kovachev for "dz", the rule is that it corresponds to non-affricate /dz/, unless it starts a syllable (including the case where it's word-initial), and in the rare case where it's word-final (chiefly foreign names). For syllabification, we indicate a dz-onset internal syllable by using the explicit syllable break character . Chernorizets (talk) 00:45, 12 December 2023 (UTC)Reply
Ah, that's right, I forgot what exactly that was about. What was it we wanted to add, again, do you remember...? Kiril kovachev (talkcontribs) 00:48, 12 December 2023 (UTC)Reply
@Kiril kovachev I'm not sure we ever made it super explicit, but my thinking is:
  • ^dz - affricate
  • dz$ - affricate
  • .dz - affricate (with the explicit syllable break before it)
  • elsewhere - /d/ + /z/
Chernorizets (talk) 02:06, 12 December 2023 (UTC)Reply
@Chernorizets For the record, is this for hyphenation, syllabification, or IPA? I feel like we can already get this behavior for syllabification/hyphenation if we just put in th ebreak character like you say, and otherwise it always seems to default to having the дз together, so I guess you mean IPA? We could maybe do some regexing to fix that if that's the case? Kiril kovachev (talkcontribs) 12:42, 12 December 2023 (UTC)Reply
@Kiril kovachev Longer-term, what I'd like to propose is that we refactor Module:bg-pronunciation into submodules - e.g. Module:bg-pronunciation/IPA, Module:bg-pronunciation/syllables, etc. - with the top-level module being dedicated to what would one day be {{bg-pr}}, putting all the pieces together. This is not an immediate-term request - just something to consider. Chernorizets (talk) 00:40, 12 December 2023 (UTC)Reply
@Benwing2 I think you had input on this when we discussed it last, but what do you think? It sounds good to me, and ultimately separation of concerns seems like a good practice. The module is getting rather long at the moment, which some might conventionally call a code smell. @Chernorizets Kiril kovachev (talkcontribs) 00:46, 12 December 2023 (UTC)Reply
@Kiril kovachev @Chernorizets I tend not to mind long modules, as long as they are well structured, but I have no objections to splitting into submodules. Benwing2 (talk) 01:26, 12 December 2023 (UTC)Reply
@Chernorizets I made an Appendix page (Appendix:Bulgarian hyphenation) to explain the differences of hyphenation and syllabification — what do you think? In Template:User:Kiril kovachev/bg-hyph, it's now implemented, and displays a little "(key)" above the labels to point to the relevant section. Kiril kovachev (talkcontribs) 21:35, 23 December 2023 (UTC)Reply
@Kiril kovachev it looks great! Thanks for putting this together! Chernorizets (talk) 23:27, 24 December 2023 (UTC)Reply
@Chernorizets Great! It's been pushed, then. In the next few days I might try to make some more features, so stay tuned perhaps :) Kiril kovachev (talkcontribs) 00:38, 25 December 2023 (UTC)Reply

acute and grave accents[edit]

@Atitarev, Kiril kovachev, Chernorizets Several years ago, grave accents were used to indicate the primary stress. User:Atitarev and I decided to switch this to acute accents for consistency with other Slavic languages, and I did a bot run of this sort. However, I've been thinking recently that maybe this makes things more confusing for the end user as other dictionaries use grave accents for this purpose. What does everyone think? I am not at all opposed to keeping things the way they are, just trying to think about what makes the most sense for end users; I'm not sure.

There's the issue of secondary stress if grave accents are used for primary stress. Some people (some dictionaries?) use acute accent for this, but IMO that just makes things impossibly confusing. In Italian, where both acute and grave accents indicate primary stress (but mark different vowel qualities), my solution was as follows when generating IPA: (1) If two or more primary stresses occur in the respelling, all but the last one are converted to secondary stress; (2) If there's a need to indicate a secondary stress following the primary stress (which is rare in Italian), this is done using a different diacritic (in Italian, a line-under is used, e.g. a̱ e̱ i̱ o̱ u̱). Such a solution wouldn't work well if it's common to find secondary stresses both before and after the primary stress (as in English), but I don't know if this is common in Bulgarian.

Note also that a similar solution to Italian is used in Portuguese respelling. Portuguese uses acute and circumflex accents to indicate different vowel qualities (é ó = /ɛ/ /ɔ/, ê ô = /e/ /o/), potentially leaving grave accent open to indicate unstressed vowels. However, it decided for consistency with other dictionaries to use a grave accent to indicate an unstressed unreduced vowel in European Portuguese pronunciation, which normally reduces unstressed vowels, similarly to Russian, but has many words where unexpectedly you have an unstressed unreduced vowel; e.g. for pregar, there's a minimal pair with respelling pregar /pɾɨˈɡaɾ/ "to nail" vs. prègar /pɾɛˈɡaɾ/ "to preach". Secondary stress is handled as in Italian, if there are two primary stresses, the first is converted to secondary, and secondary stress after the primary stress (which is basically nonexistent) can be indicated using a line-under. Benwing2 (talk) 21:12, 8 August 2023 (UTC)Reply

@Benwing2 personally, I'm in favor of keeping things as they are. I hadn't even paid attention that Bulgarian dictionaries used the grave accent to indicate stress until I started contributing to this project :-) I'd rather we not solve a problem that we don't know exists. The current way stress is indicated seems confusion-free to me; if someone does actually complain at a future point in time, then we could revisit this.
Note that, if at some point we support Banat Bulgarian (written in the Latin alphabet), it would be tricker because e.g. "a" represents /ɤ/ and "á" represents /a/ - two distinct phonemes, not just vowel qualities. This isn't a today problem, and I'm working on obtaining some Banat dictionaries to see what they do to indicate stress, but just FYI.
As for secondary stress, I'm yet to find a reference to such a notion in Bulgarian reference sources. The English notion of secondary stress doesn't really apply to Bulgarian, e.g. whereas an English speaker would say (modulo dialect) IPA(key): /ˌpɑləˈtɪʃən/, we say IPA(key): [poliˈtik] with no secondary stress. The vast majority of Bulgarian words have a single stressed syllable. In the handful of cases with more than one stressed syllable, we pronounce those syllables with equal prominence, so it's identical to using 2+ acute accents. As an example, one such word is коскоджамити (koskodžamiti), a Turkish loan, which the Dictionary of Bulgarian Language renders as "ко̀скоджа̀мити", i.e. with no distinction between the stress on the first and third syllables.
Cheers,
Chernorizets (talk) 22:13, 8 August 2023 (UTC)Reply
@Chernorizets OK sounds good, didn't realize that about secondary stress in Bulgarian. Benwing2 (talk) 22:46, 8 August 2023 (UTC)Reply
@Benwing2 I'll be honest with you—I'm with Chernorizets on this one in that I had no idea how Bulgarian dictionaries annotate stress until I started editing; and even then for a while it made no impression on me, since I only started after acutes had become the norm. Ideally, if this site were only Bulgarian, it should be graves in order to accord with the dictionaries, but I equally don't think it's possible for either choice to be totally unambiguous to readers, since, using graves, other Slavic readers may be confused as to why only Bulgarian terms have graves everywhere, and meanwhile using acutes, Bulgarian learners or even advanced users will notice it disaccording with the references' usage.
Also, it may not be so set in stone that native Bulgarian sources always use the grave: in the book I was reading just recently, the author has quite regularly used acutes to designate the stress of a word to disambiguate it from a homograph. Now, I don't know if this could be explained as just a single author's nonce usage, but as the book was published by a mainstream publisher, and I dare say this is good enough motivation that we should stick with what we've got as opposed to making any sweeping changes—which I'm sure will be troublesome for you if you have to hunt down and transform all the acutes.
In truth, grave may be slightly better because the primary consumers of the Bulgarian parts of the dictionary are the main ones on whom the impression of grave vs. acute will be made, but I think it's just not worth worrying too much about it: if it ain't broke...
I was also going to say the same thing as Chernorizets re: secondary stress, and re: your comment above, because it's also been on my mind how we handle those kinds of words on here: I've been writing basically all possible pronunciation permutations in those cases where the dictionaries provide multiple grave accents, e.g. on автоплуг, I list 3 possibilities for pronunciation, assuming the initial stress to potentially be either primary or secondary or just absent. In reality, treating all these as primary stress is probably much more sensible and it would help to do away with this type of mess... However, it's still pertinent to consider the semantics of having multiple primary stresses in a term, regarding rhymes: does it become the case that the rhyme is based off of the last stressed syllable, then? E.g. would the rhymes for автоплуг be Rhymes: -uk?
Finally, how to write headwords using multiple primary stresses? As the dictionaries do in one go, e.g. а́втоплу́г? Kiril kovachev (talkcontribs) 23:07, 8 August 2023 (UTC)Reply
@Kiril kovachev All good. Yes, I'd simply write the term with multiple primary stresses. Now, the Ukrainian and Russian modules will complain if you do that in order to discourage people from using multiple acute accents to indicate cases where the stress can be in one of two places (or sometimes three places, cf. роже́ница (rožénica) or рожени́ца (roženíca) or (proscribed) ро́женица (róženica)), but if this is currently disallowed we can certainly allow it. Benwing2 (talk) 23:19, 8 August 2023 (UTC)Reply
As for rhymes, yes I'd guess that if there are multiple primary stresses, the rhyme is based off of the last one. This sometimes happens even in English with words that have a secondary stress after the primary stress, as in the famous Shakespearean sonnet:
Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate.
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date.
Here, the word temperate was pronounced in Shakespeare's time with secondary stress on -ate /eːt/, and he rhymed based on this. Benwing2 (talk) 23:23, 8 August 2023 (UTC)Reply
Excellent, thanks for the great quote. With regard to rhymes I'll be posting again in the coming days, since I'm hoping to get moving working on a template similar to your {{es-pr}}, specifically getting it to integrate with all the features we've been discussing so far on here. Including rhymes. Kiril kovachev (talkcontribs) 00:01, 9 August 2023 (UTC)Reply
@Kiril kovachev Cool, looking forward to the new template. Benwing2 (talk) 01:19, 9 August 2023 (UTC)Reply
@Kiril kovachev, @Benwing2, @Chernorizets: (Kiril kovachev) I think I recall what I asked you about in reference to the secondary stress. It was probably about comparative and superlative forms. I remember vaguely that the answer was that the stress is equal, so Bulgarian probably misses the secondary stress. I am glad this is now resolved.
I am easy about changing back to using grave accent but you all seem to agree that it's OK to leave as is. BTW, some inconsistency exists with Macedonian where stress is predictive but when its not, it's not always marked in headwords and elsewhere. Various symbols are used inside {{mk-IPA}} for the input, when the stress is irregular.
I pinged on автоплуг edit regarding multiple stresses. In translations for pencil#Translations, I used both variants like this: мо́лив (bg) m (móliv), моли́в (bg) m (molív), rather than мо́ли́в (bg) m (mólív) (with two stresses). Giving two translations may not be ideal but less confusing than two stresses, IMO. Anatoli T. (обсудить/вклад) 01:11, 10 August 2023 (UTC)Reply
@Atitarev @Kiril kovachev @Benwing2 the template {{bg-noun}} already has support for indicating multiple stresses, by specifying the |head2=, |head3= etc. parameters (see examples in the template documentation). We would most definitely not put two accents on the same word to indicate more than one possible placement of the stress. The only valid cases of using multiple accents ought to be words with multiple stressed syllables, like recently added ко́скоджа́мити (kóskodžámiti). Chernorizets (talk) 03:15, 10 August 2023 (UTC)Reply
@Chernorizets how do the dictionaries notate words with identical meaning but just two possible stresses? I would incidicate a word like брава (brava) to check this against, but unfortunately I can't find the usual sources even state the second possible stress. If we see a word, in e.g. RBE, with two graves in one word, can we safely assume to port the same spelling to here? Kiril kovachev (talkcontribs) 09:51, 10 August 2023 (UTC)Reply
@Kiril kovachev in cases of identical meaning, but different possible stress, we should probably limit ourselves to variants for which we could find sources. For example, I've only ever heard бра́ва (bráva) (initial stress), but if all the dictionaries claimed it should be брава́ (bravá), then that's what I'd put in the Wiktionary entry. I recently had that experience with сумрак, which I've always pronounced сумра́к, but which according to all 3 dictionaries we use is су́мрак. I wouldn't put my pronunciation in there in that case.

As for what dictionaries do when a word has more than one possible stress, it's every dictionary for itself. The case of молив (moliv) is interesting, because the Chitanka dictionary gives both possibilities separately:
мо̀лив и молѝв, мн. мо̀ливи и молѝви, (два) мо̀лива и молѝва
whereas the Bulgarian Academy of Sciences dictionary puts them together:
МО̀ЛЍВ м.
I didn't even see that before, because grave accents on all-caps text are hard to see, esp in whatever font they're using. It's only when I copy-pasted it here and put it in a code block that I could actually see it. There aren't enough words in any of the languages I speak for me to express how much I hate that, Kiril - visually, didactically, and from the POV of least user confusion. I pray/beg that we never do this here, and stick with our current model of just writing out different-stress options separately. Chernorizets (talk) 12:10, 10 August 2023 (UTC)Reply
@Chernorizets That is truly wretched, lol, that makes it exceptionally hard to tell in what form to write the headwords on Wiktionary, and if it weren't for Chitanka's disambiguation there would be good potential for that exact error; so in the end, the only way for us to tell if a word actually has two primary stresses, or just two possible stress locations, is to use a source other than RBE to confirm? I support your conclusion, though, I don't know what I would've understood from headwords with multiple stress marks before we got into this discussion—I guess the average user would just be very confused. Even knowing this notation, I still prefer the clarity and explicitness of our normal approach. Definitely let's stick to that. Kiril kovachev (talkcontribs) 12:29, 10 August 2023 (UTC)Reply
Also @Atitarev, maybe a possible solution in Translations would be to just write out the lemma with no stress, and let the article speak for how the stress can be placed. Ultimately this is how it's treated as far as organizing pages, and there isn't a true semantic difference regardless of which accent you use - so you could potentially just show neither. Kiril kovachev (talkcontribs) 09:55, 10 August 2023 (UTC)Reply
@Chernorizets@Kiril kovachev@Atitarev Personally I am fine with putting two translations in the Translations section to show the different stresses. Benwing2 (talk) 18:45, 10 August 2023 (UTC)Reply
I agree, whatever feels best should be fine. There's no real space limit after all, so we could very well have both/all the accented forms too. Kiril kovachev (talkcontribs) 18:50, 10 August 2023 (UTC)Reply

ў (/w/) in IPA as well[edit]

@Benwing2, @Chernorizets: as we've already discussed in the above syllabification thread, there's the occasional marginal segment of /w/, I guess mostly from English loanwords, that we're representing as ў in syllabification so that it doesn't get mistreated as a vowel. In that case, should we also have it work this way in the IPA generation, generating /w/ as a phoneme instead of /u/? I currently did this in my namespaced module, and it appears to be fine, i.e. generates e.g. [wɛp] for ўеб (web), and other examples—I'm about to add a few more test cases as examples, if you want to see them.

Further, about the implementation: is it unnecessary to add ў to the phonetic character map, as it needs to be handled in a composed manner, i.e. it cannot be processed (as all characters besides й are right now) after decomposing off the acute/grave accents from the text? I put the conversion for ў in line with the conversion for <й> --> [j], so I guess there's no use for keeping it in the phonetic_chars_map at the top? Also, is there equally a need to have й in there? Kiril kovachev (talkcontribs) 22:36, 15 August 2023 (UTC)Reply

@Kiril kovachev answering the IPA question specifically - yes, we should surface /w/ in the IPA transcription. For example, "уиски" should have the transcription IPA(key): /'wiski/ replace ' with ˈ, invalid IPA characters ('). Chernorizets (talk) 22:41, 15 August 2023 (UTC)Reply
@Chernorizets Nice, we are indeed getting /ˈwiski/ with this :) something weird going on with Ўо́рўик though, which is somehow coming out with /ŏˈɔrwik/. (I feel like I didn't write this?! Evidently, some debugging necessary) ...however mostly things are looking fine, I added /w/ as a consonant so that the current IPA stress movement process works fine, however we ought to integrate it with the syllabification when we're fully ready with any further fixes to it. Kiril kovachev (talkcontribs) 23:00, 15 August 2023 (UTC)Reply
P.S. The above bug has now been fixed. Kiril kovachev (talkcontribs) 23:07, 15 August 2023 (UTC)Reply
@Kiril kovachev Awesome, good to hear. Yes, I agree that it should be integrated into the IPA. As for composing/decomposing, I think the correct thing when decomposing is to recompose й and ў back to single characters after decomposing. This is what I do, for example, in Module:ru-common. That way they can be handled like all other chars. Benwing2 (talk) 06:08, 16 August 2023 (UTC)Reply
@Benwing2 Ah yes, nice idea, I went and did that and it looks to be working fine. Do you recall why the IPA function was originally converting first to composed form, by the way? I converted it to express the term directly in decomposed form, then reconstruct й and ў, without going through the composed stage. It was originally (before adding ў):
::::	term = rsub(mw.ustring.toNFC(term), "й", "j")
::::	term = mw.ustring.toNFD(mw.ustring.lower(term))
Now I made it more like:
::::	term = mw.ustring.toNFD(mw.ustring.lower(term))
::::	term = rsub(term, "у" .. BREVE, "ў") -- recompose ў
::::	term = rsub(term, "и" .. BREVE, "й") -- recompose й
Is there any difference in outcome that you can think of with this change? To me it seems the same, but I'm not totally confident. Kiril kovachev (talkcontribs) 18:20, 16 August 2023 (UTC)Reply
@Kiril kovachev Text on MediaWiki is NFC by default. The initial call to NFC assumes the text was NFD at some earlier point. It looks like what you're doing is fine though. Benwing2 (talk) 19:50, 16 August 2023 (UTC)Reply
@Benwing2 Does this mean text passed into a module is always pre-converted into NFC by MediaWiki, or is this just how users typically input characters?
Anyway--if the word is needed later on in the function in NFD form, what would be the purpose of making it NFC at first, as the code was doing before? It composed the text, converted й (independently of the IPA char map) to /j/ and then decomposed the text again. Why not just decompose it from the get-go? Kiril kovachev (talkcontribs) 20:15, 16 August 2023 (UTC)Reply
@Kiril kovachev When you save a page, MediaWiki converts it to NFC, so the text will always start out NFC. In general if you need to work with decomposed chars, it makes the most sense to decompose at the very beginning as you suggest, and then (optionally) compose at the end (if you don't compose at the end it often doesn't matter as MediaWiki will compose it for you). For some languages, e.g. Spanish, it makes more sense to work the whole way through with composed characters, but Bulgarian isn't that way. Benwing2 (talk) 20:56, 16 August 2023 (UTC)Reply
@Benwing2 Oh, wow, I never realized that, I always thought that the accents were kept as-is and decomposed. Very interesting, thanks for sharing! Kiril kovachev (talkcontribs) 22:05, 16 August 2023 (UTC)Reply
P.S. I suppose this is become it doesn't compose Cyrillic letters + acute accent, as I'm assuming these don't have a composed Unicode codepoint like the Latin ones do, so they are always kept separate; whenever I'm editing an existing Bulgarian IPA parameter, it's properly kept as letter + combining accent. I never noticed that it was indeed working as you said on the ў, however... Kiril kovachev (talkcontribs) 22:20, 16 August 2023 (UTC)Reply
@Kiril kovachev Yes, there are no precomposed combinations of Cyrillic letters and acute accent, and for grave accent there are only ѐ and ѝ. Benwing2 (talk) 02:01, 17 August 2023 (UTC)Reply

Bug (or feature request?): vowel reduction is only applied on the first word of multi-word text[edit]

Example: "умолявам всички да се обърнат към министъра с въпроси"

  • {{bg-IPA|умоля́вам вси́чки да се объ́рнат към мини́стъра с въпро́си}}
    gives: IPA(key): [omoˈlʲa̟vɐm ˈfsit͡ʃki dɐ sɛ oˈbɤrnɐt kɐm miˈnistɐrɐ s vɐˈprɔsi]
    should be: IPA(key): omoˈlʲa̟vɐm ˈfsit͡ʃki dɐ sɛ oˈbɤrnɐt k(ɤ, ɐ)m miˈnistɐrɐ s vɐˈprɔsi

This is relevant to phrases: e.g. the Bulgarian phrasebook, sayings, expressions, etc. The rule IMO simply ought to be that any vowel not explicitly marked with a stress gets its unstressed reduced variant, just as in individual words. I don't remember if the module handled this correctly before the recent batch of changes we've made to the IPA functions. This is not a high-priority request, as we have only a small handful of phrases on Wiktionary at present.

For attn: @Kiril kovachev. Chernorizets (talk) 22:05, 17 August 2023 (UTC)Reply

@Chernorizets Fortunately, I don't think this is a regression, as the following is the output I got from the IPA code prior to any changes: [ʊmoˈlʲa̟vɐm ˈfsit͡ʃkʲi da sɛ ɔˈbɤrnət kɤm miˈnistərə s vɤˈprɔsi]. I guess this just needs some change to the reduction logic to make it work for every word — I'll take a look. The exact problem looks quite weird, e.g. the first у in умоля́вам is reduced, but the о in объ́рнат isn't — isn't that kind of odd? Anyway, thanks for alerting me. I shall see what's up ^^^ Kiril kovachev (talkcontribs) 22:41, 17 August 2023 (UTC)Reply
@Chernorizets Is this good now? IPA(key): [omoˈlʲa̟vɐm ˈfsit͡ʃki dɐ sɛ oˈbɤrnɐt kɐm miˈnistɐrɐ s vɐˈprɔsi]
Like you exactly pointed out, only the first word was reduced: the reason was we had the following pattern to reduce vowels before the stress (if there is any at all), (#[^#" .. accents .. "]*)(.), which previously checked if the second captured element (.) was a #, and if so deemed the word monosyllabic and didn't reduce it. I changed this to reduce it anyway, if the whole original term was polysyllabic. But we still had the problem that e.g. for this here phrase, #umɔˈlʲavam# #ˈvsit͡ʃki# #da# #sɛ# #ɔˈbɤrnat# #kɤm# #miˈnistɤra# #s# #vɤˈprɔsi#, it would match the #umɔˈ, reducing it, but then all subsequent matches are just # #! This is because the pattern didn't consume up to the rest of the word boundary, so it saw that next hash sign as the "start" of a new word, which of course is abruptly terminated by the true start of the next. Thus all that was being reduced were just sequences of "# #". I changed the pattern to (#[^#ˈ]*)(.-)#, which now swallows the final # of every word, hopefully. I found this quite a humorous revelation when I check out all the matches in the console, I was very startled to see that :') anyway, I hope I haven't broken anything with this fix — although I do note this has, possibly correctly, broken във Фра́нция: the output is now [vɐf ˈfrant͡sijɐ] instead of [vɤf ˈfrant͡sijɐ], showing that the vowel's now getting reduced. Is it safe to convert this test case to what we're now getting? Kiril kovachev (talkcontribs) 00:15, 18 August 2023 (UTC)Reply
@Kiril kovachev if you have already made the change, I'd double-check all Bulgarian multi-word examples - it's common practice today to not put a the stress on single-syllable words, so some examples may need stress added to them. Chernorizets (talk) 00:32, 18 August 2023 (UTC)Reply
@Chernorizets No, I didn't change the main module yet, just the working one - I suppose I'll check if there are any differences with those multiword ones, before pushing it out. Is there a category like that for multi-word terms? Kiril kovachev (talkcontribs) 00:40, 18 August 2023 (UTC)Reply
@Kiril kovachev yup - check under добър ден. Chernorizets (talk) 00:42, 18 August 2023 (UTC)Reply
@Chernorizets Nice, thanks. Kiril kovachev (talkcontribs) 00:43, 18 August 2023 (UTC)Reply
@Kiril kovachev the new IPA looks good for this particular test case, and yes - I'd change the във Франция test case to match the new output. Technically, the vowel in във is closer to a stressed ъ, but we're not surfacing that level of detail in the current transcription. Chernorizets (talk) 03:45, 18 August 2023 (UTC)Reply
@Chernorizets Okay, I changed the test case. What exactly is the reason for the vowel quality being more like stressed than not? I personally thought that too, but I excused it because it seems like in the flow of a sentence that unstressed is a plausible realization... But maybe we should just put a primary stress on the във if that's the case. Kiril kovachev (talkcontribs) 10:50, 18 August 2023 (UTC)Reply
@Kiril kovachev we should only use primary stress on words that are actually stressed within a phrase. As in other languages, small words like conjunctions, prepositions and pronouns between the main meaning-bearing words in a phrase are unstressed. If we put a stress, it would show up in the IPA transcription, and mislead people into thinking that the word should be pronounced with a stress. RE: the specific example - for now, it's the best we can do. The full rules for unstressed vowel reduction are somewhat more subtle than what we've chosen to implement, and that was intentional to keep the resulting IPA 1) still mostly correct, and also 2) more learner-friendly. Chernorizets (talk) 20:45, 18 August 2023 (UTC)Reply
@Chernorizets I see, that does seem fair enough — I have been lately entertaining the idea of a list of small words and so on which could perhaps be handled differently from general words, which could help with the reduction change that we just made — we are basically still operating under the assumption that single words should not be reduced, but they now will be without a manually-input stress, because of the change. It could be made so that all words are kept as-is, but these reducible small words are still reduced in whatever position they occur. Alternatively, we could have the grave accent signify unreducedness, seeing as you have identified that secondary stress is not really a feature of Bulgarian, and so the grave should well be freed up to do something else with it. Just some suggestions, though, I don't know the best solution to this problem with any degree of certainty... Kiril kovachev (talkcontribs) 21:15, 18 August 2023 (UTC)Reply
@Chernorizets I looked over multiword terms, and I added accents to a few terms which needed them to remain correct after the change. Hopefully I haven't missed any, but, anyway, the change has been pushed, which fixed a number of other strange errors with multiword terms. IPA(key): [omoˈlʲa̟vɐm ˈfsit͡ʃki dɐ sɛ oˈbɤrnɐt kɐm miˈnistɐrɐ s vɐˈprɔsi] Also, I'm noticing a need to address the nature of с/в joining onto their subsequent word, because [s] is not right in this pronunciation, is it? I think it should be [z]. Kiril kovachev (talkcontribs) 19:06, 18 August 2023 (UTC)Reply
This is the current effort to fix that с/в problem: IPA(key): [f dɛˈnʲa̟], IPA(key): [s vɐˈprɔsi]. I don't know about the notation, but if this voicing is what we want, then this should be a fix for that problem. Kiril kovachev (talkcontribs) 19:25, 18 August 2023 (UTC)Reply
@Kiril kovachev the notation is incorrect - the joiner over the letters (don't know its technical name) is used in IPA for affricates, like дж, ч, дз and ц. There is no affricate "vd" or "zv" in Bulgariano, and I'm not even sure it would be pronounceable :-) See my other comment about just writing those together with the next word specifically inside of {{bg-IPA}}. Chernorizets (talk) 20:53, 18 August 2023 (UTC)Reply
Whoops... I'll be getting rid of this then! Kiril kovachev (talkcontribs) 21:22, 18 August 2023 (UTC)Reply
What is more, I feel there may in general be a need to consider subsequent words when deciding whether to devoice final consonants of words: consider без време (bez vreme), which should be [bez vrɛmɛ], but is coming out as IPA(key): [bɛs ˈvrɛmɛ], devoiced nonetheless. I don't know if this kind of sandhi is actually normative, but I would seldom say „без“ devoiced like this in a normal setting, so I assume this is something that may need some work. @Chernorizets your thoughts on this? If this is indeed a concern then perhaps we can handle re-voicing consonants over word boundaries? Kiril kovachev (talkcontribs) 19:35, 18 August 2023 (UTC)Reply
@Kiril kovachev [bez vrɛmɛ] is definitely correct, because the words are pronounced together (= without a pause in-between), which triggers just the ordinary voicing/devoicing rules for individual words. I have to admit my knowledge is a bit patchy here - I'd check what the textbook we used for syllabification says about this. It has a large chapter with sub-chapters dedicated to Bulgarian phonology. Chernorizets (talk) 20:58, 18 August 2023 (UTC)Reply
@Chernorizets Right, I shall have to look at it as well, at some point... Good to know anyway that there's something more to work on! Kiril kovachev (talkcontribs) 21:21, 18 August 2023 (UTC)Reply
@Kiril kovachev great, thanks for pushing the change and making the fixes! As for "с" and "в", IMO the easiest thing to do is write them together with the following word within {{bg-IPA}}, since that will take care of voicing/devoicing, while also reflecting the fact that those words are pronounced together with the following word. Chernorizets (talk) 20:49, 18 August 2023 (UTC)Reply
@Chernorizets OK, that's reasonable, and probably less janky in fact, because it avoids needing a whole other rule for just this case. However, it would also require editors to be conscious of another mechanic that they need to consider; I'm feeling guilty about all the changes lately, I wonder how editors who've been absent a long time will fare if they were to come back right now :') still, I guess we ought to do what is most proper. @Benwing2 What do you think about this, i.e. whether to have в/с treatment as part of the code itself vs. manually merging them within the main argument of {{bg-IPA}}? Kiril kovachev (talkcontribs) 21:20, 18 August 2023 (UTC)Reply
@Kiril kovachev I believe we shouldn't feel guilty for making improvements, and we should definitely invest in updating the documentation of {{bg-IPA}} to reflect recent changes, with suitable examples. I think we should also realize that, even with great updated documentation, there will be people who won't have read it, and who could therefore introduce errors. That's just life.
As for what happens in code vs. not - IMHO transcribing text is more complex than transcribing individual words, because of various factors such as the ones we've discussed here. I don't think it's worth it making the pronunciation module unnecessarily complex to handle all of those factors. Where there is a simple alternative that takes advantage of existing logic - such as the case with с" and "в" - I'd much rather we do that, and add it to the documentation. Chernorizets (talk) 21:45, 18 August 2023 (UTC)Reply
@Chernorizets Quite right, I was mostly joking with that point of mine — but yes we really ought to document that all under the template documentation, I totally forgot about that side of things. For now, I've updated it with what I could think of, although actually there isn't all that much different — unless I've missed something out.
And I also agree with the aspect of keeping things simple, it's just a matter of where we draw the line: if we really insisted on transcribing whole sentences, I'm sure we could find a way that works without being ultra bloated, but still covering whatever edge cases we find. Kiril kovachev (talkcontribs) 22:23, 18 August 2023 (UTC)Reply
@Kiril kovachev as you've pointed out, not all monosyllabic words that are part of multi-word terms are subject to vowel reduction, even in the absence of explicit stress - one example being "хляб" in пълнозърнест хляб. So in theory we could optimize this case in code, but I think we shouldn't. IMO it's an easier and more consistent rule to remember to just put stresses in the phrase wherever speakers would naturally put those stresses.

Consider what the optimization might look like - presumably, we'd only want reduction for some pronouns (e.g. "го", "я") and some conjunctions (e.g. "за", "да", "на", "а", "ама", "ала" etc). But there are phrases where you'd want to put a stress on them, especially if they have other meanings besides pronoun or conjunction - "я́ ела́ ту́ка", "да́, ама не́", "а́ма ха́", "а́з съм за́". Chernorizets (talk) 10:56, 19 August 2023 (UTC)Reply
In other words, if I were to write the documentation for this, it would look like:
  • monosyllabic words require no stress mark. Polysyllabic words do.
  • multi-word terms (including phrases) require a stress mark on each word that is pronounced stressed
    • compound words (with hyphens or spaces), in particular, are multi-word terms with a stress on each word
Chernorizets (talk) 11:08, 19 August 2023 (UTC)Reply
@Chernorizets Yeah, what you said is true, it would be possible to choose specifically those words to get reduced, and no other monosyllabic ones; specifically regarding the stress issue, since the default would be to reduce them, wouldn't it be possible to just require users to put the stress there in the first place? This is in line with what we're already doing by requiring it for all monosyllabic-terms-as-part-of-a-multiword-term anyway, but keeping it down to a few specific words would mean users don't have to worry about the vast majority of cases.
However, do we principally want a stress on words like хляб on пълнозърнест хляб? I guess it's stressed as part of the word, so we'd simply want it to be stressed. In this case the above point is mostly nullified. Your documentation is good, if we're happy with that then we may do well to port it over to {{bg-IPA}} as well. Kiril kovachev (talkcontribs) 11:45, 19 August 2023 (UTC)Reply

Can we archive some of the longer discussions?[edit]

It has become very cumbersome to scroll this page :-) Chernorizets (talk) 00:53, 21 August 2023 (UTC)Reply

@Chernorizets Normally we don't do this unless the page gets much longer, in which case we archive everything before a certain date. If you are having issues scrolling, there's a table of contents at the beginning; you can just jump to the beginning and select the appropriate thread. Benwing2 (talk) 02:25, 21 August 2023 (UTC)Reply
@Benwing2 sounds good; I thought I'd seen it on another talk page. Chernorizets (talk) 02:29, 21 August 2023 (UTC)Reply

Automatic rhyme generation[edit]

@Benwing2, @Chernorizets Hello, both, I am currently experimenting with a program to generate lots of Bulgarian rhymes using a wordlist (from rechnik.chitanka.info) and some automatic detection of the rhyme suffix from the IPA.

We discussed briefly this concept before, but I'd like to revisit it now, as if we decide it's a good idea, it would be possible to roll out massive amounts of rhymes and expand a lot of entries this way.

I am currently taking the following approach to generating the rhymes:

  1. Take a list of stressed words, in my case from Chitanka. If there's a bigger word list, with stresses included, that would be even better.
  2. For each word, generate its IPA.
  3. Calculate what the suffix should be by taking the last vowel onwards.
  4. Group all words with the same rhyme in a list, with the rhyme serving as a key in a dictionary.
  5. Remove any rhyme groups with less than 3 entries: we would otherwise have thousands of rhymes that only contain 1 word. We could have the threshold be 2, that way 2 words that only rhyme with each other could still be represented.
  6. (Hypothetically) Generate a rhyme page based on this: generate the headings, count syllables for each term, and group them under the correct syllable count. Potentially, append the {{rhymes}} template to the relevant entry.

The same principle of calculating the rhyme suffix should also be pertinent to {{bg-pr}}'s future function, so I'd like to consider any problems that might arise from doing this:

  1. According to Bulgarian Wikipedia, it is not enough for terms to have the same final stressed vowel to rhyme. It suggests that the preceding consonants also need to match, e.g. гори́ (gorí) and зори́ (zorí) do rhyme, but страна́ (straná) and зора́ (zorá) do not. I actually take issue with this definition, because by this definition да (da) and Ра (Ra) do not rhyme. It seems no different to saying that he and she don't rhyme, which I definitely think do. It's my belief we should just stick to taking the suffix only, following the vowel, so that the above two do rhyme, and страна́ (straná) and зора́ (zorá) also rhyme.
  2. Terms that have more than one word, in my opinion, could still be kept, just still considering their last stressed syllable onward. E.g., ча́т-па́т (čát-pát) should rhyme with призна́т (priznát). However, I think it may be practice to not do this: @Benwing2 can you comment on this with regard to {{es-pr}} and its treatment of it?
  3. Some words have advanced/fronted articulation, e.g. шат (šat)IPA(key): [ʃa̟t], обя́д (objád)IPA(key): [oˈbʲa̟t], but фосфа́т (fosfát)IPA(key): [fosˈfat]. I assume these should be kept as separate rhymes, so that фосфа́т (fosfát) does not rhyme with the other two, but I'm asking the experts because I'm a little uncertain here :')
  4. Finally, just a problem due to unpredictable pronunciations, terms that are meant to have |endschwa=1 are indistinguishable from those that aren't (in the word list context), so I guess those that could be affected (the ones with word-final stressed а́(т)/я́(т)) should just be ignored for the rhyme generation process. In practice, in {{bg-pr}}, the user would specify |endschwa=, so in that case we could still generate the rhyme.

Please let me know of any flaws in the conception of this idea, and any improvements to be made. Thanks very much, - Kiril

P.S. You can see the current generated rhymes at GitHub Kiril kovachev (talkcontribs) 20:13, 16 December 2023 (UTC)Reply

@Kiril kovachev For #1, can't you just use the existing words in CAT:Bulgarian lemmas, which should all have stress as well as a setting for |endschwa=? Other than that, I have heard that in Russian also, words ending in a stressed vowel need the preceding consonant to rhyme. Maybe User:Atitarev can comment on Russian rhymes, but in general each language has its own rhyme structure so it's something you either just have to know or can verify by looking at poetry and songs. I would also think this rule doesn't apply to single-syllable words, so да and Ра would still rhyme. In regard to your other questions:
  1. {{es-pr}} does not generate rhymes for multiword terms. Possibly you can make an exception for hyphenated terms like ча́т-па́т but in general I don't think the rhyme tables should contain such phrases as there's an indefinite number of them.
  2. For the fronted articulation, I would guess actually that this shouldn't be considered, i.e. that обя́д and фосфа́т should rhyme, but I'm not sure; again it depends on existing practice.
  3. As for not having rhyme groups with only one or maybe two entries, this is tricky (maybe impossible) to implement automatically within Lua, so {{es-pr}} doesn't try to enforce this. The result is we do have a lot of small rhyme groups, but I think that's OK. BTW a similar issue comes up in coinages. User:Metaknowledge (who may be inactive now) asked me in 2020 to write a bot script to remove coinage-by-person categories that have only one entry in them and add a param for these entries to ensure they don't re-create the category (see User talk:Benwing2/2020-2021#Category:Coinages by language). We could do something like this although it would depend on running that script regularly, meaning that properly speaking it should be set up on Toolforge or something like that where it can be cronned automatically to run periodically. So it might not be worth it to do this.
Benwing2 (talk) 21:37, 16 December 2023 (UTC)Reply
@Benwing2 You make a good point about reading in the |noschwa= param from entries, but this suffers the problem of having to access the entries themselves (not an issue per se, but just slower than using offline methods. I don't know how to navigate the dumps very easily, and the kakki.org information does not include the parameter, sadly...). I'd consider just filtering the lists for -а(т) and -(я)т myself, to be honest.
Discussing multi-word terms, you're right there could be many terms with many words, but don't the criteria for inclusion make that problem go away? Only the multi-word terms that merit inclusion would be able to get a pronunciation (/rhyme), right?
Thanks for your feedback on the other two points: for fronted, I also don't really know if it's distinguished in actual practice, but all things equal it might be better to separate them, just because users can merge rhyme pages/categories themselves if they're made separate, but they can't as easily separate them if terms with different IPAs are grouped under the same rhyme. But, it's also not a big deal and it could even be more hurtful for discovery to have two nearly identical-looking rhyme pages for marginally different pronunciations. Also, the point of keeping rhyme groups interesting (>= 3 entries) I only meant for the generation process that I ran locally, and indeed I also don't think it would be a good idea for automatic templating to do that. Kiril kovachev (talkcontribs) 23:08, 16 December 2023 (UTC)Reply
@Kiril kovachev I guess I'm not sure why you're doing offline rhyme generation; is it just for exploration/experimentation purposes? BTW if you want code to parse MediaWiki dumps, see my bot code here: [1] This function uses the Python SAX library to parse a dump sequentially. The first param is the file handle (usually sys.stdin) and the second param is a callback function that's called with three arguments (an index of the page, the page title and the text of the page). The dumps are on dumps.wikimedia.org, e.g. the latest enwiktionary dump file is here: [2] Note that each "dump" actually comes in many versions; you generally want the boldfaced one, which is labeled "Articles, templates, media/file descriptions, and primary meta-pages". As for multiword terms, the problem is that there may be a large number of them esp. in mature languages with lots of lemmas (e.g. Spanish, Italian, Portuguese, French, Russian), and including all of them really clutters up the categories. Benwing2 (talk) 23:29, 16 December 2023 (UTC)Reply
@Benwing2 The reason was because I was not considering that the rhymes would already be provided by categories, which led me to believe it would be necessary to generate rhyme pages and manually link these to entries using the {{rhyme}} template. Even if we were to automate the rhyme calculation using a template, that rhyme link would still go to a rhyme page, which AFAIK would still need to be individually edited; so wouldn't we still need a way to populate that rhyme page, even once pages are already categorised? That was my idea, basically. And doing this via bot, i.e. mostly offline, seemed like the only way to edit all these rhyme pages. I'm not sure what we should do given these discussions, now, though. What do you think...?
And, thanks very much for your advice on processing the dump. I'm actually in a very sorry situation, in that my system partition is so full I'm not sure whether the temporary file that stores the dump as it's downloading can even fit into my temp directory — that's also an even bigger problem than setting up the parsing! But many thanks for explaining your process for this.
Also, that multiword problem makes sense now. It's not been an issue in Bulgarian yet because we still need more basic lemmas, I guess, so we've not thought about so many multiwords :) Kiril kovachev (talkcontribs) 23:42, 16 December 2023 (UTC)Reply
@Kiril kovachev Yeah my old laptop with its 500GB hard drive is almost completely full too. I have about 10GB of space and the dump file is about 1GB, so I can download and process it, but it's still way too close for comfort, esp. if the system ever starts swapping. As for the rhyme pages, there are currently two sets of pages: the rhyme category pages, which are completely automatic, and the Rhymes:... namespace pages like Rhymes:Bulgarian/na, which are still populated manually. Originally we only had the latter; User:Surjection introduced the rhyme category pages a couple of years ago. The original plan was to deprecate the Rhymes namespace pages, and Surjection laid out a process for doing that in the Beer Parlour, but it never ended up happening. So potentially we could use an offline process to help maintain those Rhymes namespace pages but I'd rather devote that work to getting rid of them, since having the same info duplicated in two places is non-ideal. Benwing2 (talk) 00:10, 17 December 2023 (UTC)Reply
@Benwing2 Ah, right, I didn't know there was such an initiative; I'm in agreement we should just use rhyme categories, then, since it seems like they can still provide the same information, and automatically at that. Thanks for informing me on this ^^ Kiril kovachev (talkcontribs) 17:05, 17 December 2023 (UTC)Reply
@Kiril kovachev here are my thoughts, based on having done some work on Bulgarian rhymes as well as looking at how other languages handle rhymes:
  • rhyme pages, such as the ones I've created, are IMO only really necessary if the language's pronunciation module doesn't generate the {{rhymes}} line automatically. {{rhymes}} creates a rhyme-specific category, with sub-categories based on the number of syllables, so that users can navigate to the category to see all words that rhyme. My personal desire would be for us to move towards automatic rhyme/syllable count detection, and we can use our syllabification and IPA facilities to do exactly that.
  • there are different types of rhyme. What we're currently using on Wiktionary for Bulgarian, and what the Bulgarian Wikipedia article also talks about, are the so-called masculine rhymes - a type of perfect rhyme. In short, such rhymes can't just contain a vowel and no consonants, except in a small handful of cases - see Rhymes:Bulgarian. Rhymes between e.g. тъма (tǎma) and война (vojna) would be considered "imperfect", which is why today we have Rhymes:Bulgarian/ma and Rhymes:Bulgarian/na. Personally, I'd like to stick with perfect rhymes. I don't see the short monosyllabic case of e.g. daРа as being a significant counterweight, since in actual poetry I wouldn't expect them to be the main carriers of the rhyme by themselves often. I'd expect something more like "... каза 'да' / ... без вода". If you feel strongly about this, though, feel free to propose the types of perfect and/on imperfect rhymes we should have for Bulgarian, and let's reflect that in Rhymes:Bulgarian.
  • advanced articulation doesn't produce a new vowel, so I've been treating those variants as part of the same rhyme - see Rhymes:Bulgarian/ar. We can change our mind on this - I don't have strong feelings about it - but I'm not sure if it's a valid distinction in Bulgarian poetry.
  • the |endschwa= issue is IMO one more reason to think about handling this as part of generating the pronunciation of words, rather than as a task to create rhyme pages. Ultimately the rhyme is decided by the shape of the final syllable, which includes other phonological processes such as word-final devoicing - e.g. мраз (mraz) rhymes with нас (nas).
  • in a world where rhymes are generated automatically by our pronunciation template, I'm not sure we need to worry about rhyme categories with too few members. Knowing that a word has no rhymes can be useful information, e.g. in poetry or song-writing (an infamous case in English being orange).
Chernorizets (talk) 21:43, 16 December 2023 (UTC)Reply
@Chernorizets I agree with everything you've written. Benwing2 (talk) 21:45, 16 December 2023 (UTC)Reply
A more tantalizing possibility for me is extending rhyme support to non-lemma forms. Bulgarian is significantly more inflected than English, and real Bulgarian poetry makes heavy use of non-lemma forms for rhyming.
Case in point - the opening stanza of a famous Dimcho Debelyanov poem:
Да се завъ́рнеш в ба́щината къ́ща
кога́то вечерта́ смире́но га́сне,
и ти́хи па́зви ти́ха нощ разгръ́ща
да приласка́е скръ́бни и неща́стни.
Only "къща" is in its lemma form. Chernorizets (talk) 22:01, 16 December 2023 (UTC)Reply
@Chernorizets Yes, we do that for Spanish and Italian when the pronunciation of non-lemma forms is available, and we could do it for Russian. I actually have written a script for Russian specifically that propagates the lemma pronunciation information to the non-lemma forms, including irregularities such as unexpected non-palatal consonants before written Cyrillic е and the proper pronunciation of geminates. For Bulgarian I'm not sure if this is necessary; maybe it's enough to just generate a pronunciation entry directly from the inflected form, given knowledge of the stress and what the inflection is (so that |endschwa= can be set correctly). Benwing2 (talk) 22:08, 16 December 2023 (UTC)Reply
@Benwing2 yeah for Bulgarian it's simpler. Outside of the lemma form, |endschwa= only plays a part in the 3ps sg plural, and it already does the right thing there - see лежат (ležat). Chernorizets (talk) 00:48, 17 December 2023 (UTC)Reply
That was supposed to be 3rd person plural :-) Chernorizets (talk) 00:49, 17 December 2023 (UTC)Reply
@Chernorizets Okay, I'm convinced by your argument for perfect rhymes. There are a number of things I didn't consider when I started writing here, such as forgetting rhyme categories existed, as well as that we can still deal with rhymes in stressed vowels using things like Rhymes:Bulgarian/ma and Rhymes:Bulgarian/ba like you posted above. This alleviates my main function concern about that kind of rule, which was the worry that it couldn't be integrated with the rhyme format on Wiktionary. I don't know why it didn't occur to me that including the consonant would also be possible, lol. Now in the case of monosyllabic words, I guess they can still retain the vowel-only rhyme suffix.
About the advanced articulation point, if we've already gone a certain way, then I support sticking to it. I doubt this particular difference is too big anyway.
If we want to directly integrate the rhyme capability into our pronunciation system, what actual technical problems do we have in doing this? As part of this generation process, I only really needed to make use of a small inventory of functions to get the job done:
  • If we want to count syllables, either we can use the syllabification function and count the syllable fence posts and add 1, or just count the vowels of the original word — is it possible for a syllable to exist in Bulgarian without a vowel? (In the current IPA module, we're using this assumption that 1 vowel = 1 syllable, e.g. in the reduction logic for monosyllabic words in the toIPA function.)
  • In order to generate the rhymes, can we just find the final vowel, calculate the suffix, and if the suffix is empty, also consider preceding consonants to determine the overall rhyme category? We would just need to delete the advanced articulation mark in this case. We can directly use the IPA primary stress mark as the position from which to reckon the rest of the suffix, I think.
For the record, my (pretty rudimentary) initial code for calculating rhymes from IPA was just this:
::def get_rhyme_ipa(ipa: str) -> str:
::    stress_index = ipa.rindex(PRIMARY)
::    rhyme_start_index = stress_index
::    while rhyme_start_index < len(ipa) and not re.match(vowels_c, ipa[rhyme_start_index]):
::        rhyme_start_index += 1  # Skip consonants until vowel is found
::    return f"{ipa[rhyme_start_index:]}"
::
It would need some adaptation for fixing the vowel-final rhymes etc., but that would be minor work, I think. But maybe it's not this simple.
Thanks very much for your detailed responses, Kiril kovachev (talkcontribs) 23:29, 16 December 2023 (UTC)Reply
@Kiril kovachev Generating rhymes is indeed quite simple in things like {{es-pr}} and {{it-pr}}; the hyphenation/syllabification is the more involved process as it needs to work off of the original spelling, which may differ in some particulars from the respelling (what I do is not generate the hyphenation/syllabification at all if they're too different). There's also some extra work in implementing a user-customizable rhyme parameter (although it's rare that you need to use it) and displaying the rhymes under the pronunciation (especially if the template supports multiple respellings and multiple dialects, as {{es-pr}} does). But all in all it's not really that complex. Benwing2 (talk) 23:33, 16 December 2023 (UTC)Reply
@Benwing2 Okay, that's reassuring. Perhaps these days I'll make an effort to see what I can make in Lua for this. If we made the "rhyme" parameter an attribute of the passed-in spelling itself (like чове́к<rhyme:ɛk>) or whatever, and just fill this in internally if it's not passed in, would that be an adequate fix?
Also it's very fortunate for us that Bulgarian rarely deviates in any major way from the spelling. I think the only truly unpredictable thing we deal with here is |endschwa= like we talked about above... so syllabification somehow managed to not be too convoluted! Kiril kovachev (talkcontribs) 23:48, 16 December 2023 (UTC)Reply
@Kiril kovachev Yes, that's exactly what I would do. Benwing2 (talk) 00:04, 17 December 2023 (UTC)Reply
@Kiril kovachev we shouldn't need any extra rhyme parameters - we can compute the rhyme directly from the IPA representation and the position of the stress. The only modification your code needs is in the case the vowel is word-final - then, unless it's preceded by another vowel (as in аншоа (anšoa)), you just include the preceding consonant (which could be palatalized, as in летя (letja)).
|endschwa= already tells the IPA logic to change an /a/ to a schwa, so the final IPA representation would already have that. The parameter also works for the 3rd ps. plural, which is the other place with a schwa, as in лежат (ležat). So, all in all, I don't see a need for a rhyme parameter.
To your other question - yes, the number of syllables in a Bulgarian word is equal to the number of vowels. Unlike Serbo-Croatian, we don't have syllabic liquids, so each syllable has a full vowel. Very few languages in the world have syllables without vowels or syllabic sonorants. Chernorizets (talk) 00:58, 17 December 2023 (UTC)Reply
@Benwing2 @Chernorizets: Module:User:Kiril kovachev/bg-pronunciation, I've had a go at making the rhyme logic. We may need some more test cases, but so far it looks like it's working (although some expectations might be wrong, I don't know).
I have two more questions re:this at this time: do the single-syllable rhymes end in their vowel, e.g. да (da) → -a? And should we make {{bg-IPA}} categorize terms by their number of syllables, like how Macedonian човек (čovek) is listed under Category:Macedonian 2-syllable words? Thanks for any criticisms, Kiril kovachev (talkcontribs) 21:25, 17 December 2023 (UTC)Reply
@Kiril kovachev Module:IPA also has support for categorizing by number of syllables. You just need to add Bulgarian to langs_to_generate_syllable_count_categories in Module:IPA/data (or to diphthongs if there are diphthongs that are transcribed with vowel symbols, but there don't appear to be). Benwing2 (talk) 21:46, 17 December 2023 (UTC)Reply
@Benwing2 I just need to add it to that table? I would do it right now, but unfortunately I need more permissions, I think. Kiril kovachev (talkcontribs) 21:58, 17 December 2023 (UTC)Reply
@Kiril kovachev I downgraded the protection to autopatroller, so you can edit it (although probably you and User:Chernorizets should be given template editor permission; if you make a request to the Beer parlour for this, I will support you). Benwing2 (talk) 22:11, 17 December 2023 (UTC)Reply
@Benwing2 Thanks again for supporting me. Based on our discussion below, we may well not edit it after all, but thanks for this. I've gone and made a request over there now. Kiril kovachev (talkcontribs) 20:57, 18 December 2023 (UTC)Reply
@Kiril kovachev I'll take a look as soon as I can. As to your question about whether we should have general categories like "Bulgarian N-syllable words", I'm torn. I've looked up what we need to do to enable this (essentially Ben's reply), but I'm not sure it's of value, and I think it introduces clutter in the category list.
Suppose we move to automatically adding rhymes on all BG entries, and suppose we also add the support for classifying BG words by syllable count. Then, for each BG word, we'd be adding these three categories:
  • Rhymes:Bulgarian/foo
  • Rhymes:Bulgarian/foo/N syllables
  • Bulgarian N-syllable words
That's on top of any topic categories we may have added, and on top of any POS-related categories e.g. "Bulgarian lemmas", "Bulgarian nouns", "Bulgarian feminine nouns". It's already quite a lot, and if the word is also present in several other Slavic languages, you end up with a huge list of categories at the bottom of the page.
TL; DR - I'm not going to oppose it if you want to add support for classifying BG words by syllable count, but I personally don't see the utility. Chernorizets (talk) 23:05, 17 December 2023 (UTC)Reply
@Chernorizets You're right about that — I don't know how much trouble having lots of categories would be, but if the rhyme already subsumes the syllable count, there'd not necessarily be a need for its own category... the only thing I can see is words for which a rhyme isn't specified, for some reason, but I think those would in general not be "pronounced" like words anyway, so syllables counts would be impertinent anyway — think итн. for example. Kiril kovachev (talkcontribs) 20:50, 18 December 2023 (UTC)Reply
In reflection, I would point out this does mean it becomes impossible to generally access words by syllable count — that information gets locked behind a specific rhyme. I don't know either how "useful" per se the information would be, but then I don't know who in general would use syllable categories. They exist for other languages, so I presume it's believed to be useful, at least by someone. I think to a degree it's not too big of a problem to introduce the new category, anyway: it's just one in a list that can get pretty big out of necessity anyway; just check out any big entry with lots of languages, or even any Chinese entry like , where there are already dozens of categories just from the 1 L2 header. Moreover, since the categories are alphabetized down there, and they're much more useful used from the reverse direction (searching the category itself, rather than looking for the category name on an entry that's a part of it), the difficulty of navigating categories at the bottom should be much less than whatever benefit being able to search the category gives. Kiril kovachev (talkcontribs) 21:03, 18 December 2023 (UTC)Reply
@Kiril kovachev if you think it would be useful to have syllable count categories for Bulgarian, go for it. I'm not sure what value they add, but maybe they do and I just don't see it. Fun fact - a user who was randomly creating categories in various languages also created a couple for Bulgarian N-syllable words, and I had to explain to them that we didn't have that support yet :-) Chernorizets (talk) 22:29, 18 December 2023 (UTC)Reply
@Chernorizets I don't really know, to be honest. I think I'll ask in different places and see what others think about this. Also interesting fact :) Kiril kovachev (talkcontribs) 00:32, 19 December 2023 (UTC)Reply
@Kiril kovachev I took a look at your get_rhymes function and the test cases for it, and I have a couple of observations:
  • word-final vowel stress
    • the rhyme of народността is just /ta/, not /sta/ - there might be multiple consonants after the primary stress due to how we shift it, but the rhyme only contains the preceding consonant. Same issue with [izˈmra] (which btw should have |endschwa=). измра rhymes with дера and пера.
    • make sure the logic can handle words like аншоа and буржоа, which don't have any consonants before the word-final stressed vowel. In both of those cases, the rhyme is just /a/. I'd add those as test cases.
  • monosyllabic words
    • if I'm not mistaken, for monosyllabic words that end in a vowel like да, не, спри, your logic would determine the rhyme to be the vowel itsel, i.e. /a/, /e/, and /i/ respectively. Is that by design? I'd also add a few such test cases to your list.
  • make sure you add some test cases for words with palatalized consonants - летя, кипя, огняр, свят, etc. We treat the palatalization superscript as a consonant, so I want to make sure we're not including it in rhymes by itself. The examples I gave should produce /tʲa/, /pʲa/, /ar/, /at/.
Thanks,
Chernorizets (talk) 22:57, 18 December 2023 (UTC)Reply
@Chernorizets Thanks, this is the kind of thing I needed clarification on. The inclusion of all those consonants was by design, because I didn't know whether it's supposed to just be the one directly before, or all of them, but luckily that's an easy fix. I believe that same logic does work for аншоа, but indeed I should make a case for it.
For monosyllabic words, indeed that's deliberate, but I actually wondered what you think it should be. In my eyes it makes sense for the rhyme to be in the vowel, so да should also technically rhyme with буржоа.
And for palatalized words: to make sure I'm getting this right: the palatalizer should still be included in the rhyme, but it shouldn't be at the start? Either way, I'll put some test cases tomorrow. Kiril kovachev (talkcontribs) 00:39, 19 December 2023 (UTC)Reply
@Kiril kovachev a couple of thoughts:
The assignment of words to rhymes should be non-contradictory. E.g. if you believe that "сне" rhymes with "пране" (as I do), then the rhyme is /ne/. But "сне" also obviously rhymes with "не", so we have three choices:
  • treat monosyllabic and polysyllabic words the same w.r.t. how the rhyme is determined, in which case all three words rhyme with /ne/
  • for monosyllabic words specifically, take advantage of the fact that {{rhymes}} can take multiple rhyme parameters, so words like "сне" and "не" have both /e/ and /ne/ as rhymes. That way, you could claim that "не" rhymes with "пране" but also with "кое" without contradictions.
  • for monosyllabic words, define the word-final vowel to be the sole rhyme. So "не" rhymes with "спре" and "сне" via /e/, but "спре" doesn't rhyme with "оре" and "сне" doesn't rhyme with "отне".
I personally prefer the first approach because it seems cleaner to me, but I'd accept the second one as well. The third one IMO just goes against common sense and would be wrong.
As for the palatalization mark - the goal is to distinguish rhymes like /tʲa/ vs /ta/, as in цветя vs лета̀. In our current treatment of Bulgarian phonology, we subscribe to the traditional view that palatalized consonants are their own phonemes, so these two rhymes would be different. In recent years, the view that palatalized consonants are instead just consonant + "j" (no palatalization) has been gaining traction in academic circles, but until it becomes more widespread, I think we're justified in sticking with the status quo. In other words, /tʲa/ is a rhyme, and it's distinct from /ta/. Chernorizets (talk) 01:25, 19 December 2023 (UTC)Reply
@Chernorizets Yes, I agree with you that 1 would be the best. I don't want to claim anything that isn't certain, and I just surveyed my mother who also finds that не and спре don't rhyme :)
Following your recommendations, I've updated the logic and now updated the test cases, which should cover all the possibilities that we've discussed here — let me know what you think. The result of palatalized-consonant rhymes have been fixed: indeed it used to just detect the palatalization as the "consonant" of that rhyme, when the final vowel was stressed, but now it's coded to consider a palatalization on the consonant and include the whole palatalized consonant if it exists.
Also: if the palatalization were analyzed as /j/, would that change how rhymes need to be handled? Would the rhymes not still need to include the /j/ in that case? (If the rhyme of летя is /tʲa/, by the other analysis it would make sense for it to be /tja/, so in a way the palatal analysis makes more sense in that you don't need to make exceptions for whether the consonant is /j/ or not; if you consider /tʲ/ a single consonant, the logic of including the previous consonant in the rhyme evenly handles this already.)
Anyway, hope this is good — Kiril kovachev (talkcontribs) 20:44, 19 December 2023 (UTC)Reply
@Kiril kovachev let's not worry about the alternative treatment of palatalization for now. I only mentioned it because it might become relevant one day. Whether the rhyme in that case would technically be /tja/ or just /ja/ I'm not sure, although /tja/ makes more sense to me.
Regarding get_rhymes - is it just for prototype purposes, or you intend it to be called as part of something else? Chernorizets (talk) 22:39, 19 December 2023 (UTC)Reply
@Chernorizets I intend for it to go into {{bg-pr}} when it's all done. Besides generating audio links, I believe this is the last component that that would need, actually. ^^ Kiril kovachev (talkcontribs) 23:07, 19 December 2023 (UTC)Reply

Respellings with ў - orthography vs respelling[edit]

@Kiril kovachev, @Benwing2, @Fenakhay: Hi,

Are you able to add the test case respelling param or make a substitution, so that e.g. "ўе́стърн" appears as "уе́стърн (uéstǎrn) (respelled ўе́стърн)", otherwise it appears as if [[ў]] is a Bulgarian letter. Anatoli T. (обсудить/вклад) 02:46, 22 December 2023 (UTC)Reply

Also @Chernorizets. Anatoli T. (обсудить/вклад) 02:47, 22 December 2023 (UTC)Reply
Hi @Atitarev, I'm not sure what exactly you're asking for? We recently (a few months ago) introduced the convention of using the Belarusian letter "ў" in {{bg-IPA}} as a way of conveying the /w/ sound - it exists in Bulgarian via loanwords, but doesn't have its own letter. That is also explained in the documentation of the template, where we make it clear that the letter is from a different alphabet. Chernorizets (talk) 04:10, 22 December 2023 (UTC)Reply
@Chernorizets: In test cases red-linked ўе́стърн links to a non-existent word. It should link to уе́стърн (uéstǎrn) and say "(respelled ўе́стърн)" similar to how some other words with respellings are used. Just search visually for some cases with "respelled" in Module:bg-pronunciation. Anatoli T. (обсудить/вклад) 04:16, 22 December 2023 (UTC)Reply
@Atitarev Ahh ok, you're talking specifically about test cases. We'll fix that. Chernorizets (talk) 04:24, 22 December 2023 (UTC)Reply
@Chernorizets: That's right, thanks. Notice also how Russian entries requiring respelling are displayed. Pls see ве́стерн (vɛ́stɛrn). Perhaps Bulgarian entries can use |ann=y, as in this revision (revert if inappropriate). Anatoli T. (обсудить/вклад) 04:32, 22 December 2023 (UTC)Reply
@Atitarev please no. The choice to use that letter is an internal implementation detail, and should not be shown in an annotation. This would be confusing to Bulgarian users, whether native or learners of the language. I've seen the way Russian entries requiring respelling are handled - for now, I don't see a need to introduce something like this for Bulgarian. With the exception of /w/, we don't have a lot of instances where we need to respell. I'll revert the edit. Chernorizets (talk) 04:45, 22 December 2023 (UTC)Reply
@Chernorizets: No problem, I suspected it might be the case. Anatoli T. (обсудить/вклад) 04:48, 22 December 2023 (UTC)Reply
@Atitarev mostly fixed the IPA testcases - respellings are now showing - except for one. I'll need to figure it out. Chernorizets (talk) 05:04, 22 December 2023 (UTC)Reply
@Chernorizets: Ah, "Ўо́руик"? I see, thanks. Anatoli T. (обсудить/вклад) 05:09, 22 December 2023 (UTC)Reply
@Atitarev I had a typo in the code :-) Fixed. Chernorizets (talk) 05:37, 22 December 2023 (UTC)Reply