User talk:Chernorizets/Bulgarian Lemma Improvement Project

From Wiktionary, the free dictionary
Latest comment: 7 months ago by SimonWikt in topic Initial thoughts
Jump to navigation Jump to search

My impressions / apprecdiation

[edit]

@Chernorizets Wow! This here is truly, incredibly detailed and visionary on your part! It is seriously stunning how much time this project draft must have taken, so I am eager to inform you I think this is a great idea, and we should definitely try as hard as we can to carry it out. It's gladdening to see you also share this vision of making this platform a genuinely worthy dictionary for Bulgarian learners/users; so far, I think we do a tremendous job in particular of detailing etymologies, but indeed, there is still room to grow in all dimensions.

First of all, what do you see as being concrete ways we can carry out this project? I will propose my own ideas here:

  • I've been thinking more again lately about our previous work on hyphenation/syllabification. I'll be honest - I forgot exactly what we were up to with respect to that - but I do recall we already principally have the backend ready for both hyphenation and syllabification, and it's just a matter of gluing these together to improve the scope of {{bg-hyph}}.
  • How will we collaborate on this project? Should we just not overthink it and each do what we can? Or should we think of something more systematic? My particular thought would be something like a big table of lemmas, where we can each sign our names next to a word to indicate we've checked that entry and brought it up to a given level. If you're familiar with Wikisource, vaguely like their verification system: users can check over certain content to ensure it's up to the standard we want.

At any rate, thanks very much for this initiative. This is brilliant, Kiril kovachev (talkcontribs) 14:20, 15 November 2023 (UTC)Reply

Hey @Kiril kovachev, thanks for the kind words! The writing part didn't take that long, to be honest, because I had sort of been writing it in my head for the past couple of months :-)
Your question about how people collaborate on the project is a good one, and my answer is indeed to not overthink it. I call it a project, by it might more aptly be described as a call to action, and as such is just an ongoing improvement initiative. Originally, I was going to have a sign-up sheet where people could volunteer to work on words starting with specific letters of the alphabet. Then I realized people might want to work on particular groups or types of words (e.g. technical vocabulary) rather than on an alphabetic principle. So I decided not to have a sign-up sheet, and not to add extra bookkeeping burden for participants in general. I think what I might do is add a section at the end for editors to optionally mention their overall area of contribution, if for no other reason than to reduce duplication. Although, with 10k+ lemmas, I'm not so worried about duplication to be honest.
Cheers,
Chernorizets (talk) 20:56, 15 November 2023 (UTC)Reply
I see, that sounds reasonable. Adding more bureaucracy wouldn't by itself do much good for us after all. What you've written here is a great goal to aspire to, in my opinion. I look forward to your future levels :) Kiril kovachev (talkcontribs) 21:26, 15 November 2023 (UTC)Reply
@Kiril kovachev I did add an optional section at the end for people to list what they're working on if they feel like it.
As for your other question about hyphenation/syllabification - like you said, it shouldn't in principle be hard to retool {{bg-hyph}} to provide both. The annoying thing would be figuring out how to deal with the bullet points for those two pronunciation components - if the template returns them, then that would break existing pages, since we type * {{bg-hyph}}. IMO at that point it would probably be more profitable to revisit the {{bg-pr}} idea and let that template generate all applicable bullet points. Chernorizets (talk) 01:31, 16 November 2023 (UTC)Reply
@Chernorizets In as far as the bullet points, at least for the moment it would be possible to just make it place the second bullet point if applicable, but rely on the user for the first one; this would principally break if you needed the hyphenation to be nested, but I think so far we've not got anything like that...? That said, we would benefit for sure from finishing {{bg-pr}}. My old WIP module is kind of dissatisfactory, in that I'm not all too sure what kind of interface we want for the template to have, so I don't know what needs to be possible in the module. But we should discuss that all wherever we're developing that. All in all the linguistic stuff is mostly already done after all, so I suppose it's just a software problem now :) Kiril kovachev (talkcontribs) 06:26, 16 November 2023 (UTC)Reply
@Chernorizets Hi, now I've gone and read the tiers again and I think everything you suggest is a good direction for editing in the future. Regrettably, I don't think I've really got much to add, except a few points:
  • I would encourage anyone else who can to also record audios. I don't think my voice and pronunciation are necessarily the most representative of the standard Bulgarian pronunciation, and also my mic is really bad and my voice peaks and screeches too much in some audios. Of course, I'm trying to ensure the quality is up to standard, but this could be alleviated if we had a better sample size of audio recorders. For anyone who sees this, I wish that they be invigorated to help with the effort.
  • Can we port over quotations from RBE? Most definitions there are already supplied with a named author, which leaves us to find the year and full author name and title, but the text is usually a very high-quality and helpful illustration of the usage. If it's not damning with regard to copyright, I would say this is an excelletn springboard if we are to add quotations.
  • Similarly, would scraping Bulgarian vernacular names of plants and animals from Wikispecies be a valid strategy for filling out some missing entries? You already allude to this in level 4, but I would take it a step further by automatically scraping every Wikispecies entry with Bulgarian vernacular definitions. Theoretically this way we would have a big creation list that states the Bulgarian and English names of each organism and let editors make an entry of it quite easily. This may not save much time for a given entry, but systematizing the process may be helpful.
Thanks for the updates, Kiril kovachev (talkcontribs) 20:51, 30 November 2023 (UTC)Reply
@Kiril kovachev thanks for the comments and suggestions! Here are a few thoughts:
  • Regarding quotations from RBE, that's a good idea, and there are actually downloadable PDFs which list all the author and work abbreviations. You can find them on the main page under "Източници на използвания илюстративен материал и съкращенията им". What I'd like us to try to do is create quotation templates for commonly quoted works - e.g. {{RQ:bg:Vazov:PodIgoto}} - so that all one needs to do is provide the BG text, the English translation and ideally the page. That's a mini-project in its own right, and I'll definitely need help to make it happen.
  • The idea of scrapping Bulgarian vernacular names off of Wikispecies is really interesting. The only caveat I'd provide is that, since Wikispecies is similarly community-edited, and the vernacular names aren't its main focus, it shouldn't be our only source for names. The Bulgarian Etymological Dictionary often quotes a source abbreviated as "БотР" when it gives systematic names for plants in particular - that happens to be a 1939 botanical dictionary available on Archive.org.
Your audios are great! Every word sounds optimistic :-) Having heard my fair share of tired-sounding Bulgarian audio on other sites, I think this is far better for learners. Keep up the great work! Chernorizets (talk) 21:32, 30 November 2023 (UTC)Reply
@Chernorizets Oh, you're right, that is one massive bibliography. But how would we figure out the most commonly quoted works? Just as we go, inspecting entries?
Also, I guess it's just me being biased, but adding vernacular names is the main thing I've ever done on Wikispecies, so I thought it would be a massive untapped goldmine of uncreated entries, which nevertheless I'd still like to exploit sometime in the future - but still, I do agree we should diversify it using that dictionary, which BTW is an excellent spot! We should make that a reference template — what name would you give it? (I trust you as the Christener since you're very good at giving names I do think :'))
And thanks about my audios, lol. With respect to that, I do hope to make a big drive for audios in about 10 days or so from now — I'm let off on holiday soon and I'll have a lonngg while to chip away at the contents of Category:Bulgarian lemmas. Kiril kovachev (talkcontribs) 22:07, 30 November 2023 (UTC)Reply
@Kiril kovachev well I better make more entries for you then! lol
Regarding the botanical dictionary - yes, that's a good target for a reference template. I might call it something like {{R:bg:Botanical:1939}}, since there are more recent botanical dictionaries that someone could conceivably use (I don't know if any of them are online, but I doubt it because it would be a copyright issue). Also, I wouldn't give up on your idea of using Wikispecies - I'm just conveying what one of their editors told me about the "weight" of vernacular names vs. the core taxonomic data on there.
Regarding the quotation templates - yes, my idea was to just default to creating a quotation template whenever we'd use a quote from the dictionary. Some of those templates will get reused more than others, but that's OK. It's only slightly more work than doing {{quote-book}} directly. Chernorizets (talk) 22:35, 30 November 2023 (UTC)Reply
@Chernorizets Haha, I dare you. Anyway, that all makes sense. I'm a fan of these ideas for sure.
I just got a new idea, by the way, which probably aligns itself with Tier 4: do you think we should add pre-reform spellings as as alternative forms in our entries? There may or may not be value in this, I couldn't say, because, although I massively like the idea of relating entries to their historical orthographies, I also believe it might be extraneous information for many readers. Adding an "alternative forms" header also gives undue weight to the thought that these forms are an actual alternative that one might see these days, but it would really only apply when reading old literature (unexpurgated). Even if we put {{bg-PRO}}, it may be slightly cluttersome. Nevertheless, I still feel like doing it, but I'm interested in your thoughts. Kiril kovachev (talkcontribs) 22:50, 30 November 2023 (UTC)Reply
@Kiril kovachev historical spellings are an interesting topic - I don't currently have a good answer of how much is too much vs. enough. At a minimum, if I find myself citing {{R:bg:Gerov}}, I include the Gerov spelling in the alternative forms, because otherwise it could be confusing to a casual reader why I'm adding a reference to a seemingly "different" word. Outside of that, there are several issues to consider:
  • while we currently lump everything before 1945 as "pre-reform orthography", there were actually several orthographic reforms since the re-establishment of Bulgaria - see Категория:История на българския правопис, and more specifically (in order of introduction): Дриновски правопис, Иванчевски правопис (interrupted for two years by the Омарчевски правопис), and then finally the 1945 reform. There are a number of words that aren't spelled the same under all of these orthographic standards.
  • going even further back, the Modern Bulgarian phase of the language is said to start in the 16th century, so literature from 16th - 19th c. is also within scope for a Bulgarian entry. Spelling during that period was still heavily influenced by the Old Church Slavic tradition, with many letters that aren't used in subsequent orthographies. In the very few instances I've added such forms (see e.g. кадър) I've resorted to using {{defdate}} to provide some time marker, but it feels somewhat ad-hoc.
  • in both of the above cases, we need to think about how transliteration ought to work. E.g. in pre-reform "градъ", the final "ъ" is silent - I'm not sure what the correct way to transliterate that is. We could probably ask some of the Russian editors, since their 1918 reform chopped off some word-final silent letters IIRC. For pre-Liberation texts, it's a bit trickier, since e.g. the yers (ь ъ) could stand for a variety of vowels. Maybe it's simpler than I think - transliteration is not pronunciation, after all - but I don't know the answer OTOH. Chernorizets (talk) 23:36, 30 November 2023 (UTC)Reply
@Chernorizets Indeed, there is certainly some kind of balance to be struck, and it's quite a lot of information for people to take in if you consider all the possible spellings over time. At some point the spellings become so rare that the average reader will never see that spelling used at all, outside of Wiktionary anyway. That said, it would definitely be good to split up the orthography qualifiers like you pointed out.
About transliteration, I encountered some trouble with perspectives on this point, because you've got to consider who it's for. For one, those who already can read the script don't need the transliteration, so it's not for them; and those who can't read it won't benefit much if the transliteration doesn't adequately convey the pronunciation; they won't benefit just from seeing a Latinised version of a script whose pronunciation they still can't understand. Whilst the purest function of transliteration is just to convert between scripts, arguably this is not so beneficial as using a transliteration that indicates pronunciation intuitively. I don't really know how the ъ at the end of words was historically pronounced; I guess it had some kind of pronunciation? But since when has that not been the case? If answering that is too hard, maybe just straight-up transliterating regardless of a specific period's pronunciation values would be best.
And, when writing out old forms, I'm entertaining an alternative idea about where to put them, such as as part of the headline: consider the practice used by Japanese, e.g. on 末子#Etymology_1, where the historical spelling is given its own superscript note if the editor specifies it. This extends to a yet-older spelling, "old" spelling, which implies this can be generalized to a number of old orthographies if we wanted to do that. In our case, "historical" would be pre-1945, and we could have the various stages before that as additional old forms perhaps. I honestly find the parallel to Japanese to be quite fitting, because it also underwent a massive reform after 1945, which clearly delineates the literature from the present day from the past. The only difference is that (AFAIK) Bulgarians don't have as strong of a consciousness of their old orthography as the Japanese, and dictionaries don't feature the old spellings anymore either.
Kiril kovachev (talkcontribs) 12:02, 1 December 2023 (UTC)Reply

Initial thoughts

[edit]

@Chernorizets What a great idea! I'm impressed that you've included goals aimed at helping Bulgarian learners. As a learner myself, the things I find most valuable are:

  1. Quality translations in Bulgarian and English entries.
  2. Examples of translations.
  3. Stress indicators.
  4. Audio, as I don't find the IPA very helpful. I don't have the inclination to learn it either, Bulgarian is enough for now! I also find the romanisation system awkward and prefer the Google approach to romanisation (simpler and more intuitive е.g. kuche as opposed to kúče).

Next comes things like gender, declensions, adverb forms of nouns and conjugations.

Then things like etymоlogies, related terms, synonyms etc.

Keep up the good work!

Thanks. SimonWikt (talk) 19:11, 15 November 2023 (UTC)Reply

@SimonWikt thanks for the learner POV! More of the things you've listed will be captured under Tier 2 and 3 - I hope to have those out soon.
Cheers,
Chernorizets (talk) 20:58, 15 November 2023 (UTC)Reply
@SimonWikt I just finished the writeup (I think). Tiers 1 to 3 capture the bulk of the work, and Tier 4 is extra credit. Lemme know what you think. Chernorizets (talk) 08:07, 27 November 2023 (UTC)Reply
Hi @Chernorizets. Thanks for this, a lot of very good information and guidance! I hope I can do a worthy job of Tier 1 and most of Tier 2.
The labelling of dated, archaic or obsolete is challenging even with your improved guidelines! Should I default to obsolete for остар. and archaic for старин. or just leave them out?
Thanks, SimonWikt (talk) 10:02, 2 December 2023 (UTC)Reply
@SimonWikt let me know which part of the guideline on dated/obsolete is challenging or unclear, so I can try to improve it and make it easier to follow. In the meantime, I'd recommend that you don't create entries for dated/obsolete/archaic words unless applying the guideline for them seems uncontroversial. You can always request such words on Wiktionary:Requested entries (Bulgarian). Chernorizets (talk) 10:10, 2 December 2023 (UTC)Reply
@Chernorizets I don't create entries for dated/obsolete/archaic words but I do sometimes come across dated/obsolete/archaic senses in RBE and Chitanka for common words. The challenge I have is not not with your guidelines but, as a non-native speaker, with my lack of ability to apply them, such as the "Quotation Test" or the "Heard-It-Before Test" :)
SimonWikt (talk) 10:19, 2 December 2023 (UTC)Reply

Feedback request

[edit]

Hi @Benwing2,

This is a rather long read for one sitting, so there's no hurry, but I'd be curious to get your thoughts. In particular, I think there are parts of this writeup that should probably live in Wiktionary:About Bulgarian, as well as parts that simply reflect my opinionated take on how we should go about improving Bulgarian entries. I'm just not sure which is which.

I feel like the stuff in what I call "Tier 1" should be the minimum bar for new entries, and it's a higher bar than a fair amount of existing Bulgarian entries. I made a change to WT:ABG in October which brings the "basic entry" example more in line with what I describe in Tier 1, but I'm unwilling to make further changes without some discussion on potentially undue burden for newcomers.

Thanks,

Chernorizets (talk) 08:20, 27 November 2023 (UTC)Reply

@Chernorizets I'll take a look in the next day or so ... it's about my bedtime now. If I don't get to it within two days, remind me again, thanks! Benwing2 (talk) 09:01, 27 November 2023 (UTC)Reply
@Chernorizets I took a look at Tier 1 and Tier 2. I think overall this is very well written and thought through. A few comments:
  • I have a Python script among my bot scripts that facilitates generating Bulgarian entries. I've used this to generate some Bulgarian entries (e.g. Bulgarian прекарам (prekaram)), and similar scripts were used to generate a whole lot of Russian entries and some Ukrainian entries. One thing you might do is create an equivalent Lua module; there are such modules for a lot of languages and I partly created one for Russian in Module:User:Benwing2/ru-new that's a direct port of my Python script, but not finished. What's especially useful about this is it minimizes the repetitive typing esp. of terms that need to be in multiple places in the wikitext.
  • For Tier 1, I'm not sure hyphenation needs to be included, and ideally this should be handled through a combined {{bg-pr}} template (I think User:Kiril kovachev made a similar point).
  • For Tier 2, one thing you might add is usage examples or collocations to illustrate relational adjectives. In the Russian entries I created, I generally didn't include usage examples (since as a non-native speaker it's hard for me to create them and to be sure they're grammatical and idiomatic), but I made a point of including collocations to illustrate relational adjectives; otherwise they may seem a bit mysterious to someone who's not familiar with the concept.
  • For Russian in particular, the English distinction between archaic, obsolete and dated turned out to be rather tricky since Russian monolingual dictionaries often just write устар. without further distinction. User:Atitarev and I made the decision to gloss this as dated since this is the least restrictive of the three; not ideal but better than not including the info at all. If the same issue exists in Bulgarian, you might make note of it.
  • Another issue we ran into was terms denoted as прост. in Russian dictionaries. This stands for просторечный, which is often glossed as colloquial, but that doesn't convey the difference between прост. and разг. = разговорный, also glossed as colloquial. We chose to gloss this using {{lb|ru|low|_|colloquial}}, which is somewhat of a made-up term but conveys the basic gist that the term is somehow considered uneducated without exactly being nonstandard (in the English sense), and also categorizes into CAT:Russian colloquialisms. (We should probably add "low colloquial" as a Russian-specific label, but this hasn't been done yet.) Another such term that has no real equivalent in English is ласк. = ласковый, which we gloss using {{endearing form of}}, or {{endearing diminutive of}} if combined with уменьш. = уменьшительный = diminutive, as is common. In Czech, meanwhile, Czech monolingual dictionaries have a term that we've glossed as expressive (in this case, we did add a Czech-specific label, which appropriately links to a section of WT:About Czech). If there are similar such terms in Bulgarian, you might make note of them and how to handle them.
Benwing2 (talk) 05:57, 30 November 2023 (UTC)Reply
@Benwing2 thanks for the feedback!
  • Usage examples and collocations are in scope for Tier 2, so definitely yes to that.
  • I made a section in Wiktionary:About Bulgarian about how to handle the dated/archaic/obsolete distinction, given a similar situation to what you've described for Russian. It was the result of a BP thread a while back. It's still not ideal, but it's workable. That section is referenced from within this doc.
  • The "colloquial" thing has been on my mind for a while. Bulgarian dictionaries also have several terms that we loosely equate with "colloquial", and in reality we'd need to have a richer set of labels to convey those distinctions. E.g. we have простонар. (prostonar.) for простонароден (prostonaroden, (approx.) traditional among or typical of "common folk"). It's distinct from "low colloquial", which we do use for certain less-than-polite colloquial terms, but простонар. (prostonar.) doesn't have that connotation. I left it out from this doc because 1) I don't have a good answer, and 2) it's a bit advanced. I had to draw a line somewhere on what to cover.
  • we have endearing diminutives too, and other flavors of diminutives. It's on my list somewhere to start a BP thread about there being a richer set of diminutives, just like we might need a richer set of "colloquial" flavors.
I've seen the concept of a quick-new-entry template in some languages (Chinese I think), where some articles start as subst-ed invocations of a template. I wasn't sure if that was used elsewhere, but good to know it's not a frowned-upon practice.
Thanks for talking a look,
Chernorizets (talk) 07:55, 30 November 2023 (UTC)Reply
@Chernorizets Of course and I will try to look at Tiers 3 and 4 as well. In Russian I remember running into народный and нар.-поэт. and such. I think the former got rendered as popular or folk or something and the latter as folk poetic. I think it's good to come up with standard translations for these so people won't translate them ad-hoc. Even if they are made-up Wiktionary-specific terms, IMO this is better than inconsistent translations because at least we can link them to WT:About Bulgarian or whatever with a proper explanation. Benwing2 (talk) 08:31, 30 November 2023 (UTC)Reply