User talk:Jberkel

From Wiktionary, the free dictionary
Jump to navigation Jump to search
Archive
Archives

Catalan pronunciations[edit]

Hi, just a note to be careful when adding Catalan pronunciations. For example, you added a pronunciation of ê to esquetx, which is wrong (it should be é) and unlikely in any case, since ê generally only occurs with inheritances and some old borrowings, and esquetx is a recent borrowing from English. I have documented the sources of pronunciation in the documentation to {{ca-IPA}}; in particular, only trust the DCVB for Balearic pronunciations and don't trust cawikt at all. Benwing2 (talk) 02:34, 28 January 2024 (UTC)[reply]

@Benwing2: Ok, I thought cawikt was fairly reliable. Btw, thanks for your great work on the Catalan corner! Jberkel 10:42, 28 January 2024 (UTC)[reply]

Statistics[edit]

Hi Jberkel, willst du noch einen neunen Update der Statistik machen? Dein letzter stammt schon wieder vom 1. Juli. Ja, ich weiß dass es eine Menge Zeit und Computerkraft beansprucht, aber ich denke wir alle möchten das einfach schon mal wieder wissen. :) Steinbach (talk) 17:18, 22 February 2024 (UTC)[reply]

@Steinbach Hallo, würde ich gerne regelmäßig machen, aber es gibt immer noch Datenprobleme mit den HTML-Dumps: phab:T305407. Die letzten einigermaßen kompletten Daten sind vom letzten Juli. Die WMF-Leute arbeiten daran, aber irgendwie dauert das ewig, bin schon ständig am nachfragen :( Jberkel 17:42, 22 February 2024 (UTC)[reply]

HTML Dump[edit]

Hi, I saw your posts complaining about the lack of HTML dumps as I had the same issue. I ended up creating my own HTML dump using the API to rapidly download millions of entries. I used the 20240220 XML dump as a base so that the two dumps would include exactly the same revisions. Note that the same wikitext can produce different HTML code at different points in time, so I can't guarantee that the page looks exactly as it did at the time of the XML dump.

  • Pages included: non-redirects in namespaces 0 (main) and 118 (reconstruction)
  • Number of lines: 7,952,575
  • Time generated: February ‎20, ‎2024, ‏‎7:49:52 PM to ‎February ‎22, ‎2024, ‏‎1:16:18 AM (EST)
  • Uncompressed size: 112,213,194,308 bytes
  • Compressed size: 5,482,140,342 bytes

Would you be interested in the code or the dump itself?

Ioaxxere (talk) 20:05, 22 February 2024 (UTC)[reply]

@Ioaxxere Lol, I'm close to starting a project myself, given the glacial progress on the WMF side. Yes, I'm interested, how did you get the HTML, how long does it take? Is it the Parsoid rendered version which is used in the HTML dumps? If you want we can join forces and run it as a community project. Jberkel 09:44, 23 February 2024 (UTC)[reply]

The script works by grabbing HTML data using a revision ID. For example: https://en.wiktionary.org/w/api.php?action=parse&oldid=65853771&format=json. I'm not sure what parser is used but it seems to correspond with "view page source" in my browser. Here is the code:

Then I verified the output with this code:

Which produced:

These correspond with pages in the XML dump that have recently been deleted.

I don't have the time/resources to generate these on a regular basis, but you're welcome to adapt this code for your purposes!

Ioaxxere (talk) 19:56, 23 February 2024 (UTC)[reply]

Oh god, I just realized that adding &parsoid=true to the API query gives *far* better data. Time to rerun... Ioaxxere (talk) 20:09, 23 February 2024 (UTC)[reply]
Cool, thanks! We could run it on WMF infrastructure. Great to see that 50 lines of Python yield better results than the WMF's buzzword soup of Kafka, DAGs and what have you… How long does it take to do a full run? Jberkel 15:20, 26 February 2024 (UTC)[reply]
nm, you already had in your post, almost 2 days… :) Jberkel 15:57, 26 February 2024 (UTC)[reply]
Even if the WMF some day manage to produce useful dumps again, we'll still need wiki-specific namespaces such as Reconstruction, so it'll be useful to have some way of generating them ourselves. Jberkel 15:58, 26 February 2024 (UTC)[reply]

ScribuntoUnit vs. UnitTests[edit]

I just discovered there are two unit testing frameworks here, Module:UnitTests used by everyone but you, and Module:ScribuntoUnit used by you. The former is older than the latter, so I'm not sure why you imported the latter from Wikipedia, but I think we should consolidate. Can you think about converting your unit tests to use Module:UnitTests? Benwing2 (talk) 20:34, 10 March 2024 (UTC)[reply]

Hi, just wondering if you got my msg. Can you at least clarify why you imported and started using Module:ScribuntoUnit in preference to our own module? BTW I just discovered a third unit test framework, Module:QFQ/UnitTests, used only on Module:mnw-translit. Benwing2 (talk) 07:43, 14 March 2024 (UTC)[reply]
Hi @Benwing2, sorry had short Wiktionary hiatus. It's been a long time (~ 10 years), but I think when I first looked at Module:UnitTests it was a spaghetti mess and didn't have the features I wanted. That's probably no longer the case, and I agree it's better to standardize on one framework. Jberkel 09:27, 15 March 2024 (UTC)[reply]

Wwoww, Jberkel, you're fast. Wanted to cite the same Guardian passage here, and it was already there ... MistaPPPP (talk) 12:55, 19 March 2024 (UTC)[reply]

Apologies[edit]

I need to apologise to you also, about my simple edit in my archaic paragraph about certain 'etymologies that discredit Wiktionary' that it should have completely disrupted the edit section including yours - there should really be mechanism in place to stop this from happening, since any innocent editor could well make a similar mistake that if not detected quickly as both Surjection and I did, it could cause linguistic mayhem! Regards, Andrew Andrew H. Gray 11:40, 29 March 2024 (UTC)

On ass...[edit]

What Doyle said was about this:

https://en.m.wiktionary.org/wiki/arse#English

Here, ass is another way of spelling arse (as in dumb). Lunatone3000 (talk) 22:24, 4 April 2024 (UTC)[reply]