User talk:DCDuring/1000EnglishEntries

The code[edit]

time \
    bzip2 -d < enwiktionary-20120626-pages-articles.xml.bz2 \
  | perl -w -Mstrict -e \
      'use constant SAMPLE_SIZE => 1000;
       
       $/ = "</page>\n";
       
       my $count;
       my @sample;
       
       while(<>)
       { last unless m/<\/page>/;
         next unless m/<ns>0<\/ns>/;
         next if m/<text xml:space="preserve" \/>/;
         die unless m/<text xml:space="preserve">([^<]+)<\/text>/;
         next unless $1 =~ m/^== *English *== *$/mg;
         die unless m/<title>([^<]+)<\/title>/;
         my $title = $1;
         ++$count;
         if($count <= SAMPLE_SIZE)
           { push @sample, $title; }
         else
         { my $tmp = int rand $count;
           if($tmp < SAMPLE_SIZE)
             { $sample[$tmp] = $title; }
         }
       }
       
       foreach my $column (0 .. 3)
       { print $column == 0 ? "{{top4}}\n" : "{{mid4}}\n";
         my $min_index = int($column * SAMPLE_SIZE / 4);
         my $max_index = int(($column + 1) * SAMPLE_SIZE / 4) - 1;
         foreach my $i ($min_index .. $max_index)
           { print "* [[$sample[$i]#English|$sample[$i]]]\n"; }
       }
       print "{{bottom}}\n";
      ' \
  > sample-of-1000-English-entry-titles-20120626.txt

—Ruakh_TALK 19:07, 4 July 2012 (UTC)[reply]

Wouldn't this tend to get more pages on the list from toward the end of the the XML file? The 10000th page had a one in ten (1000/10000) chance of getting into @sample (though it might be later ousted); the 1st page, however, would almost certainly be ousted by the time you got to the end of the XML file.

...unless the XML file itself is in pseudorandom order (is it? I have no idea), in which case this works, though then you can just use last if $count > SAMPLE_SIZE; push @sample, $title without the else.—msh210℠ (talk) 01:35, 5 July 2012 (UTC)[reply]

The XML dump is not in pseudorandom order, no; it seems to be in order by page-ID, which is, to a close approximation, order by when the page was initially created. But I think this algorithm is sound; if there are 1000+n English entries, then each one should have a 1000/(1000+n) chance of being in the sample. Here's a sketch of a proof by induction:

If there are 1000+0 = 1000 English entries, then each occurs once in the sample; that is, each has probability 100% = 1000/(1000+0) of being in the sample.
Assume that the assertion is true for a specific value of n. Then, what happens if there are 1000+n+1 English entries? Well, first we use the algorithm to create a 1000-member sample S_n out of the first 1000+n of them, such that (by assumption) each of those has probability 1000/(1000+n) of being in S_n. Then, according to the algorithm, we create our sample S_n+1 by giving the last English entry a 1000/(1000+n+1) chance of displacing a randomly-selected member of S_n, or put another way, we give each member of S_n a 1/(1000+n+1) chance of being displaced, that is, a (1000+n)/(1000+n+1) chance of not being displaced. To recap: each of the first 1000+n entries has a 1000/(1000+n) chance of making it into S_n, and then, if it makes it there, a (1000+n)/(1000+n+1) chance of making it into S_n+1. (And of course it has no chance of making it into S_n+1 unless it makes it into S_n.) So its total chance of being included in S_n+1 is 1000/(1000+n) × (1000+n)/(1000+n+1) = 1000/(1000+n+1), Q.E.D.

(Full disclosure: This algorithm is not original with me. I was already familiar with a well-known algorithm for selecting a random line from a file, and all I did was extend it to selecting a fixed-size random subset. I'm certain I'm not the first person to do that; in fact, Google suggests that even this "extended" version might be in Donald Knuth's The Art of Computer Programming, in which case it too is very well known.)

—Ruakh_TALK 02:14, 5 July 2012 (UTC)[reply]

The proof looks correct (and elegant) :-) . Sorry for wasting your time.—msh210℠ (talk) 04:31, 5 July 2012 (UTC)[reply]

No worries. :-) —Ruakh_TALK 11:59, 5 July 2012 (UTC)[reply]

User talk:DCDuring/1000EnglishEntries

The code[edit]

Navigation menu

Search