Wiktionary talk:About Old Polish

From Wiktionary, the free dictionary
Latest comment: 1 month ago by Vininn126 in topic Old Polish import and corpora
Jump to navigation Jump to search

Regional Old Polish

[edit]

@Benwing2 I wanted to get your input on the technical aspect of an idea I have. I want to somehow mark the locations of attestation of Old Polish words somewhere (I intend to add the place of creation to the quotation templates), as Old Polish was a collection of dialects. I'm not sure marking it in a label is the best idea, as that would imply the word was only used in a given location, which we just cannot be sure of. I was considering doing it in the etymology section, something like "attested in Greater Polish, Lesser Polish, Masovian, Silesian". What do you think? Vininn126 (talk) 09:30, 6 May 2024 (UTC)Reply

@Vininn126 This sounds like a Beer Parlour discussion question. I personally think labels are probably fine and have the advantage that they auto-categorize. We can just add a proviso of sorts to labels stating that in the case of extinct languages, the labels reflect attestation rather than certainty as to where terms were used. (Even in living languages, this may be the case as well, esp. in lesser-documented languages.) Benwing2 (talk) 19:12, 6 May 2024 (UTC)Reply
In that case, I've started a BP discussion, perhaps your right. That could be a solution, I suppose. Will need more input. Vininn126 (talk) 19:18, 6 May 2024 (UTC)Reply

Old Polish import and corpora

[edit]

@Benwing2 As I've been talking with Bożena Sieradzka-Baziur, the main editor of {{R:zlw-opl:SPJSP}}, she's said that her work there is the common good and also that she doesn't have a problem scanning her site and such. I have two ideas

  • Task 1:

Make a corpus, or rather, two corpora. We could scan the text in the "Przykład w transliteracji" field and put that in one column in a spreadsheet and then "Przykład w transkrypcji" and put that in another column. Also, I think we might want to ignore text in parenthesis generally, explained in the section about Przykład w transliteracji. Either that or we have two versions, one with text in parenthesis and one without. I suppose in theory we could then tell the bot to take the abbreviation (ignoring the page number after) scan for an equivalent RQ, and put that information in column C? I'm not sure if you've ever used antconc, but you can make corpora with bibliographic info etc. I should be able to make 2 based on the transliteration and transcription. A few of the abbreviations will be redlinks. It should be kept in mind that RRp takes a Roman numeral after, we have {{RQ:zlw-opl:RRp XXII}}, {{RQ:zlw-opl:RRp XXIII}}, {{RQ:zlw-opl:RRp XXIV}}, {{RQ:zlw-opl:RRp XXV}}. The same goes for WokPet I-VIII (each has their own template, but also this is an abbreviation used by Rozariusze, so we don't need to worry about these).

  • Task 2:

I believe an import of her data should be possible. Let's take B. Sieradzka-Baziur, Ewa Deptuchowa, Joanna Duska, Mariusz Frodyma, Beata Hejmo, Dorota Janeczko, Katarzyna Jasińska, Krystyna Kajtoch, Joanna Kozioł, Marian Kucała, Dorota Mika, Gabriela Niemiec, Urszula Poprawska, Elżbieta Supranowicz, Ludwika Szelachowska-Winiarzowa, Zofia Wanicowa, Piotr Szpor, Bartłomiej Borek, editors (2011–2015), “obchodzić, hobychodzić się, obchodzić się, obychodzić się”, in Słownik pojęciowy języka staropolskiego [Conceptual Dictionary of Old Polish] (in Polish), Kraków: IJP PAN, →ISBN as an example. I also would like to import data into basically a boilerplate that I'll paste below. I will determine the page name, I guess.

  1. artykuł hasłowy: if there are multiple objects separated by a term ("obchodzić, hobychodzić się" etc.), then all those objects should be placed into the first template of the template {{R:zlw-opl:SPJSP}}, and the first number of the url should be placed as parameter two. so obchodzić, hobychodzić się, obchodzić się, obychodzić się becomes {{R:zlw-opl:SPJSP|obchodzić, hobychodzić się, obchodzić się, obychodzić się|7955}}. I say first number in the url, because if you click different senses you get a forward slash and then a second number. Next, we can supply these forms in the alternative forms sections. After, we can remove any instances of się (also of course removing the space before it, so obchodzić się > obchodzić) and then remove any duplicates created this way. So, the alternative forms section becomes {{alt|zlw-opl|hobychodzić|obychodzić}}. In theory we could also check if any of these forms differ only in diacritics and supply that to {{also}} but I think that might be a pain to code.

Next is "jednostki", which each contain on the right "szczegóły". Each "unit" has a separate "details" page. We mostly need to focus on details.

  1. Szczegóły
    1. Typ - ignorable (I guess scan to see what exists here but it's never been relevant to an entry)
    2. Rodzaj - ignorable (I guess scan to see what exists here but it's never been relevant to an entry). The types are autosemantic and synsemaatic. I'm not sure we have a system for this. You can explore the types here: types. She has them organized also by multiword phrases and a few other categories. But again, I'm not sure we can make use of this at the moment.
    3. Numer - the sense number. First will be a number in parenthesis, which is the number of the quote. Then there might be either a ~ which tells you this belongs to the sense above, or a second number not in parenthesis, which tells you this is a new definition, which we can put on a new definition line. I believe inna is for words with uncertain meanings, niepełna A is for collocations, I think. niepełna B is used when there is only a quote supporting an above definition.
    4. Definicja - the definition. It would be nice to put this after {{lb|zlw-opl|attested in|}}. Maybe what we can do is put the Polish definition there and then after that put a Google Translate version? This should speed up my process of checking. The definition will be in italics. Sometimes before the defintion, you have text not in italics, i.e. [1]. This might also be participles for verbs. I think just placing everything on the definition line will work. I will have to check verbs.
    5. Gramatyka - Part of speech. There might be multiple parts of speech per page, so multiple headers will be needed. The parts of speech used on the site are as follows:
      1. rzeczownik - {{zlw-opl-noun}}. Gender is not supplied. We can do a guess by checking the last letter of the entry, but there will be cases where this is wrong. If it ends in -e, -ę, or -o, it should be neuter, if it ends with let's supply feminine, and if it's a consonant let's supply m without -pr, -anml, -in, or -an!. Nouns in consonants may also be feminine, but since I have to check for animacy anyway, I have to check all nouns ending in a consonant.
      2. czasownik lub forma czasownikowa - {{zlw-opl-verb}}. Sometimes participles are also here, as I mentioned. I do not think it will be possible to supply aspect, as the way to tell is by checking if an imperfective or perfective verb is used in the definition, but it's not marked in the code or anything.
      3. przymiotnik, zadiektywizowany imiesłów lub imiesłów - {{zlw-opl-adj}}. Sometimes participles are here, but generally are semantically adjectives. I do not think it will be possible to supply comparatives or adverbs, but that's okay.
      4. przysłówek - {{zlw-opl-adv}}. As I said, I don't think the comparative will be automatically suppliable, but that's okay.
      5. zaimek - {{h|zlw-opl|pronoun}}. Not everything here is a pronoun and I will have to check.
      6. liczebnik - {{h|zlw-opl|numeral}}. Not everything here is a numeral and I will have to check.
      7. przyimek - {{h|zlw-opl|preposition}}.
      8. particle - {{h|zlw-opl|particle}}.
      9. spójnik - {{h|zlw-opl|conjunction}}.
      10. wykrzyknik - {{h|zlw-opl|interjection}}.
    6. Semantyka - SPJSP is an onomasiological dictionary, and as such words are categorized. I'm really not sure we can import the categories easily, despite that being the main point of the dictionary (she even wrote a book about making an onomasiological dictionary, which she has asked me to read). The categories used on the site can be found here.
    7. Przykład w transliteracji - This equates to the quote and will equate to different paramters for different templates. See "Lokalizacja" below. I also hope it will be possible to see the underlined words on her site (Obchodzicz bødøø masto (circuibunt civitatem)) and convert that to Obchodzicz bødøø masto (circuibunt civitatem). Also, text in parenthesis may often be deleted, as this is often the translation of the original document. Or maybe it's the wrong decision to delete that?
    8. Przykład w transkrypcji - This equates to the normalization. In the future I (along with @Silmethule) have agreed that a different, more phonemic normalization should be applied, but this is a project unto itself. After this, the quotation template should be closed with |-}}, as I am not supplying translations yet. This is also a project unto itself. Bolding and deletion of content in parenthesis applies above.
    9. Lokalizacja - this equates to an RQ templates. Each RQ template works a little differently, I'm not sure how much you can gather from scanning their parameters or uses. If it's not possible to figure where parameters (which determine volume and page) should go, then perhaps the transliteration should be parameter one, transcription should be parameter two, then the nullified translation parameter 3. I do think many should be possible to automate. There are a few typos on her website, if you could check for any abbreviations that do not match an RQ template, that would be appreciated, so I can fix those and also let her know. Quotations might be the most complicated aspect of the import. However, in theory if done correctly, we can also supply some other fields. We should be able to make a guess and supply etydates taking the earliest date supplied by a quote and putting it in the template. This might be a pain to code. The same goes for regional information. Also, considering I want to make a corpus to supply yet more quotes, it might be best to skip this, and I should maybe do it manually.
    10. There are other fields, I can't think of them all at the moment. It would probably behoove us to do a scan to just see what exists. I know this is a huge undertaking. I hope that this information covers most cases.

Other fields, such as etymology, sometimes IPA, descendants, references, and categories, will clearly have to be supplied manually. You can see "Rozariusze" in the references, this is a different site, much smaller. I honestly think it would be possible to import that information manually. Perhaps not including Boryś, Mańczak, or Sławski is also a good idea? My workflow tends to be about removing them as opposed to adding them, however. I suppose in theory we could automatically remove Bańkowski for any entries after r (so up through r, but removed starting at s), and remove Sławski for any entries after ł (so through ł, but removed starting at m). I think it should be possible to generated derived/related terms via the etymology section once this is "complete", so I think we can maybe skip that for now. I also wouldn't be opposed to automatically adding {{etymon}} once this is complete.

==Old Polish==

===Alternative forms===
* {{alt|zlw-opl|}}

===Etymology===
{{bor+|zlw-opl|}}. {{etydate|}}.
{{dercat|zlw-opl|ine-bsl-pro|ine-pro|inh=2}}
{{inh+|zlw-opl|sla-pro|*}}. {{surf|zlw-opl|}}. {{etydate|}}.
From {{af|zlw-opl|}}. {{etydate|}}.

===Pronunciation===
* {{zlw-opl-IPA}}

===POS===
{{zlw-opl|}}

# {{lb|zlw-opl|attested in|}} [[]]
#* {{RQ:zlw-opl:||-}}

====Descendants====
* {{desc|zlw-mas|}}
* {{desc|pl|}}
* {{desc|szl|}}

===References===
* {{R:pl:Boryś}}
* {{R:pl:Mańczak}}
* {{R:pl:Bańkowski}}
* {{R:pl:Sławski}}
* {{R:zlw-opl:SPJSP}}
* {{R:zlw-opl:Rozariusze|+|}}

{{C|zlw-opl|}}

Vininn126 (talk) 10:45, 23 July 2024 (UTC)Reply

@Sławobóg, @KamiruPL, @Tashi as the most active Old Polish editors. Vininn126 (talk) 10:49, 23 July 2024 (UTC)Reply