User talk:DCDuring/1000EnglishEntries

From Wiktionary, the free dictionary
Jump to navigation Jump to search

The code[edit]

time \
    bzip2 -d < enwiktionary-20120626-pages-articles.xml.bz2 \
  | perl -w -Mstrict -e \
      'use constant SAMPLE_SIZE => 1000;
       
       $/ = "</page>\n";
       
       my $count;
       my @sample;
       
       while(<>)
       { last unless m/<\/page>/;
         next unless m/<ns>0<\/ns>/;
         next if m/<text xml:space="preserve" \/>/;
         die unless m/<text xml:space="preserve">([^<]+)<\/text>/;
         next unless $1 =~ m/^== *English *== *$/mg;
         die unless m/<title>([^<]+)<\/title>/;
         my $title = $1;
         ++$count;
         if($count <= SAMPLE_SIZE)
           { push @sample, $title; }
         else
         { my $tmp = int rand $count;
           if($tmp < SAMPLE_SIZE)
             { $sample[$tmp] = $title; }
         }
       }
       
       foreach my $column (0 .. 3)
       { print $column == 0 ? "{{top4}}\n" : "{{mid4}}\n";
         my $min_index = int($column * SAMPLE_SIZE / 4);
         my $max_index = int(($column + 1) * SAMPLE_SIZE / 4) - 1;
         foreach my $i ($min_index .. $max_index)
           { print "* [[$sample[$i]#English|$sample[$i]]]\n"; }
       }
       print "{{bottom}}\n";
      ' \
  > sample-of-1000-English-entry-titles-20120626.txt

RuakhTALK 19:07, 4 July 2012 (UTC)[reply]

Wouldn't this tend to get more pages on the list from toward the end of the the XML file? The 10000th page had a one in ten (1000/10000) chance of getting into @sample (though it might be later ousted); the 1st page, however, would almost certainly be ousted by the time you got to the end of the XML file.
...unless the XML file itself is in pseudorandom order (is it? I have no idea), in which case this works, though then you can just use last if $count > SAMPLE_SIZE; push @sample, $title without the else.​—msh210 (talk) 01:35, 5 July 2012 (UTC)[reply]
The XML dump is not in pseudorandom order, no; it seems to be in order by page-ID, which is, to a close approximation, order by when the page was initially created. But I think this algorithm is sound; if there are 1000+n English entries, then each one should have a 1000/(1000+n) chance of being in the sample. Here's a sketch of a proof by induction:
  • If there are 1000+0 = 1000 English entries, then each occurs once in the sample; that is, each has probability 100% = 1000/(1000+0) of being in the sample.
  • Assume that the assertion is true for a specific value of n. Then, what happens if there are 1000+n+1 English entries? Well, first we use the algorithm to create a 1000-member sample Sn out of the first 1000+n of them, such that (by assumption) each of those has probability 1000/(1000+n) of being in Sn. Then, according to the algorithm, we create our sample Sn+1 by giving the last English entry a 1000/(1000+n+1) chance of displacing a randomly-selected member of Sn, or put another way, we give each member of Sn a 1/(1000+n+1) chance of being displaced, that is, a (1000+n)/(1000+n+1) chance of not being displaced. To recap: each of the first 1000+n entries has a 1000/(1000+n) chance of making it into Sn, and then, if it makes it there, a (1000+n)/(1000+n+1) chance of making it into Sn+1. (And of course it has no chance of making it into Sn+1 unless it makes it into Sn.) So its total chance of being included in Sn+1 is 1000/(1000+n) × (1000+n)/(1000+n+1) = 1000/(1000+n+1), Q.E.D.
(Full disclosure: This algorithm is not original with me. I was already familiar with a well-known algorithm for selecting a random line from a file, and all I did was extend it to selecting a fixed-size random subset. I'm certain I'm not the first person to do that; in fact, Google suggests that even this "extended" version might be in Donald Knuth's The Art of Computer Programming, in which case it too is very well known.)
RuakhTALK 02:14, 5 July 2012 (UTC)[reply]
The proof looks correct (and elegant)  :-) . Sorry for wasting your time.​—msh210 (talk) 04:31, 5 July 2012 (UTC)[reply]
No worries. :-)   —RuakhTALK 11:59, 5 July 2012 (UTC)[reply]