|
| 1 | +# Wiki Data |
| 2 | +Here, we prepare training data from an English Wikipedia data dump. |
| 3 | +Nothing too interesting, just a series of tedious steps because I don't know any better tools for this. |
| 4 | + |
| 5 | +## Steps |
| 6 | +### Download |
| 7 | +Download a dump of the English Wikipedia in XML format. |
| 8 | +This can be done here for example: |
| 9 | +[https://dumps.wikimedia.org/backup-index.html](https://dumps.wikimedia.org/backup-index.html). |
| 10 | +What we are interested in is a file like `enwiki-latest-pages-articles.xml.bz2`. |
| 11 | + |
| 12 | +The dump that we experiment with here is named `enwiki-20190201-pages-articles.xml` and has a |
| 13 | +compressed size of 15GB. |
| 14 | + |
| 15 | +### Split |
| 16 | +I use [xmldump2files.py](https://github.com/adamwulf/wikipedia2text/blob/master/xmldump2files.py) |
| 17 | +to split the XML dump into individual files, one per document: |
| 18 | +``` |
| 19 | +bzcat enwiki-latest-pages-articles.xml.bz2 | ./xmldump2files.py /dev/stdin docs |
| 20 | +``` |
| 21 | +Documents will be saved in some kind of hash tree in the `docs/` directory. |
| 22 | +For example, there will be the file `docs/2f/7c/Abraham_Lincoln.txt`. |
| 23 | + |
| 24 | +### Filter Documents |
| 25 | +I get about 10M extracted documents, as `xmldump2files.log` shows: |
| 26 | +``` |
| 27 | +Redirects 8465477 Deleted 0 Disambigs 271019 Lists 231905 Skipped 0 Wrote 10200000 50.01GiB Total 10200000 50.01GiB (185%) |
| 28 | +``` |
| 29 | +The official |
| 30 | +[Wikipedia statistics](https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Annual_growth_rate_for_the_English_Wikipedia) |
| 31 | +say that there are currently about 5.8M articles. |
| 32 | +The dump that I downloaded contains a lot of non-articles. |
| 33 | +Note that [our version of xmldump2files.py](xmldump2files.py) already filters out redirects and disambiguations. |
| 34 | +The most serious remaining offenders can be found like this: |
| 35 | +``` |
| 36 | +find docs/ -name '*.txt' \ |
| 37 | + | grep -o '/[^/%]*%3' \ |
| 38 | + | sort \ |
| 39 | + | uniq -c \ |
| 40 | + | awk '{print $1,$2}' \ |
| 41 | + | sort -k1,1 -n \ |
| 42 | + | tail -n20 |
| 43 | +``` |
| 44 | +I've collected them in `doc-list-filter.grep` to filter the document list: |
| 45 | +``` |
| 46 | +find docs -name '*.txt' | grep -vF -f doc-list-filter.grep > docs.txt |
| 47 | +``` |
| 48 | +I'm left with 5.78M of 10.2M documents. |
| 49 | + |
| 50 | +### Convert to Markdown |
| 51 | +We convert from the MediaWiki markup to Markdown using [pandoc](https://pandoc.org/), |
| 52 | +together with a custom filter [filter_markdown.py](filter_markdown.py) written with |
| 53 | +[panflute](http://scorreia.com/software/panflute/) |
| 54 | +that removes content that is not useful for us. |
| 55 | + |
| 56 | +Here's how to convert and filter one document: |
| 57 | +``` |
| 58 | +pandoc --wrap=none -f mediawiki -t markdown < test/out/7a/77/Astronomer.txt \ |
| 59 | + | pandoc --filter filter_markdown.py -t markdown |
| 60 | +``` |
| 61 | +The script `convert-doc.sh` applies this conversion to stdin. |
| 62 | +We can use [GNU Parallel](http://www.gnu.org/s/parallel) to apply it to all articles, |
| 63 | +writing the output to the filename suffixed by `.md`: |
| 64 | +``` |
| 65 | +parallel --verbose -j 8 ./convert-doc.sh '<' {} '>' {.}.md \ |
| 66 | + < wiki/docs.txt \ |
| 67 | + 2>&1 | tee convert.log |
| 68 | +``` |
| 69 | +This may take a few days. |
| 70 | + |
| 71 | +There are some downsides to using Pandoc here, since it does not handle Wikipedia template |
| 72 | +references, and instead seems to remove them in the output. This leads to a few sentences |
| 73 | +missing words in the middle. This is a relatively rare occasion, so it should not be much |
| 74 | +of a problem. |
| 75 | + |
| 76 | +Also, the conversion crashes sometimes: |
| 77 | +``` |
| 78 | +/home/leod/src/hncynic/data-wiki/convert-doc.sh < docs/85/Munich%E2%80%93Augsburg_railway.txt > docs/85/Munich%E2%80%93Augsburg_railway.md |
| 79 | +Traceback (most recent call last): |
| 80 | + File "/home/leod/src/hncynic/data-wiki/filter_markdown.py", line 114, in <module> |
| 81 | + main() |
| 82 | + File "/home/leod/src/hncynic/data-wiki/filter_markdown.py", line 98, in main |
| 83 | + return run_filter(action, prepare=prepare, doc=doc) |
| 84 | + File "/home/leod/.local/lib/python3.6/site-packages/panflute/io.py", line 260, in run_filter |
| 85 | + return run_filters([action], *args, **kwargs) |
| 86 | +... |
| 87 | + File "/home/leod/.local/lib/python3.6/site-packages/panflute/elements.py", line 1061, in __init__ |
| 88 | + self.header = header |
| 89 | + File "/home/leod/.local/lib/python3.6/site-packages/panflute/elements.py", line 1097, in header |
| 90 | + raise IndexError(msg) |
| 91 | +IndexError: table header has an incorrect number of cols: 6 rows but expected 8 |
| 92 | +pandoc: Error running filter /home/leod/src/hncynic/data-wiki/filter_markdown.py |
| 93 | +Filter returned error status 1 |
| 94 | +``` |
| 95 | +How often? |
| 96 | +``` |
| 97 | +$ grep "pandoc: Error running filter" convert.log | wc -l |
| 98 | +208757 |
| 99 | +``` |
| 100 | +This means we'll loose about 3.6\% of the articles while converting to Markdown. |
| 101 | +Not cool, but I can live with it. |
| 102 | + |
| 103 | +### Convert to TSV |
| 104 | +We use each section of an article as an individual training example. |
| 105 | + |
| 106 | +``` |
| 107 | +find docs -name '*.md' > docs.md.txt |
| 108 | +parallel --verbose -j 8 \ |
| 109 | + ./clean_text.sh \ |
| 110 | + '<' {} \ |
| 111 | + '|' ./md_to_tsv.py {} \ |
| 112 | + '>' {.}.tsv \ |
| 113 | + < docs.md.txt \ |
| 114 | + > convert.tsv.log 2>&1 |
| 115 | +``` |
| 116 | + |
| 117 | +The resulting data is far from perfect, as e.g. it still contains some leftover Wiki markup. |
| 118 | + |
| 119 | +## Concatenate |
| 120 | +Split into train/dev/test (this time it's easier because we have one file per title): |
| 121 | +``` |
| 122 | +find docs -name '*.tsv' > docs.tsv.txt |
| 123 | +shuf docs.tsv.txt > docs.tsv.shuf.txt |
| 124 | +awk 'NR <= 2000' docs.tsv.shuf.txt > docs.tsv.dev.txt |
| 125 | +awk 'NR > 2000 && NR <= 4000' docs.tsv.shuf.txt > docs.tsv.test.txt |
| 126 | +awk 'NR > 4000' docs.tsv.shuf.txt > docs.tsv.train.txt |
| 127 | +``` |
| 128 | +Sanity check: |
| 129 | +``` |
| 130 | +$ sort -u docs.tsv.txt | wc -l |
| 131 | +5801101 |
| 132 | +$ cat docs.tsv.{train,dev,test}.txt | sort -u | wc -l |
| 133 | +5801101 |
| 134 | +``` |
| 135 | +Concatenate: |
| 136 | +``` |
| 137 | +cat $(cat docs.tsv.dev.txt) > dev.tsv |
| 138 | +cat $(cat docs.tsv.test.txt) > test.tsv |
| 139 | +
|
| 140 | +# SLOW: |
| 141 | +while read file; do cat $file; done < docs.tsv.train.txt > train.tsv |
| 142 | +``` |
| 143 | +I found the last command for concatenating the training data to be quite slow (I estimated |
| 144 | +that it would take more than 1 day to complete). Maybe this is because of the overhead of |
| 145 | +starting a new `cat` process for each of the almost 6M files. I've written a small Python |
| 146 | +utility that makes this step run significantly faster: |
| 147 | +``` |
| 148 | +./cat_stdin.py < docs.tsv.train.txt > train.tsv |
| 149 | +``` |
| 150 | + |
| 151 | +### Normalization? |
| 152 | +Now, we could again apply Moses preprocessing etc. but I'm not sure if it is the right way to go, |
| 153 | +due to all the special symbols such as Latex code and the triple backticks in Markdown. Also, |
| 154 | +Wikipedia text itself already is pretty well normalized, so we probably can get away without |
| 155 | +tokenization. |
| 156 | + |
| 157 | +Okay, so for now the only normalization we do here is to lowercase the titles. |
| 158 | +``` |
| 159 | +./preprocess_tsv.sh train |
| 160 | +./preprocess_tsv.sh dev |
| 161 | +./preprocess_tsv.sh test |
| 162 | +``` |
| 163 | + |
| 164 | +### Issues |
| 165 | +After all this, there still are a bunch of issues with the data. Here's what I know of: |
| 166 | +- [`md_to_tsv.py`](md_to_tsv.py) occasionally outputs a title like this: |
| 167 | + ``` |
| 168 | + 2015 africa cup of nations qualification group e: ------------------------------------------------------------------------ |
| 169 | + ``` |
| 170 | + This probably is because of a failed header detection. This happens in only 9908 of the 16593956 |
| 171 | + titles in the training data. |
| 172 | +- The articles sometimes still contain table markup. |
| 173 | +- Even though I filtered many redirects, the training data still contains some. |
| 174 | + I count 15768 (0.1% of all examples) in the final training data. |
| 175 | +- As mentioned above, some Wikipedia template markup such as automatic unit conversion is not handled, |
| 176 | + resulting in incomplete sentences. |
| 177 | + |
| 178 | +### BPE |
| 179 | +Similar to the [Hacker News data](../data), we learn a BPE word segmentation on the training data. |
| 180 | +We have a lot more training data here than before, so we use [fastBPE](https://github.com/glample/fastBPE), |
| 181 | +which is a faster implementation. |
| 182 | + |
| 183 | +Still, learning BPE on the full data takes a long time, so let's just use a subsample: |
| 184 | +``` |
| 185 | +paste train.pp.{titles,comments} | shuf > train.pp.shuf.titles-comments |
| 186 | +cut -f1 train.pp.shuf.titles-comments | head -n 2000000 > bpetrain.pp.titles |
| 187 | +cut -f2 train.pp.shuf.titles-comments | head -n 2000000 > bpetrain.pp.comments |
| 188 | +
|
| 189 | +fastBPE/fast learnbpe 32000 bpetrain.pp.titles bpetrain.pp.comments > bpecodes |
| 190 | +``` |
| 191 | +Apply segmentation to data: |
| 192 | +``` |
| 193 | +for i in {test,dev,train}; do |
| 194 | + for j in {comments,titles}; do |
| 195 | + fastBPE/fast applybpe $i.pp.bpe.$j $i.pp.$j bpecodes |
| 196 | + done |
| 197 | +done |
| 198 | +``` |
| 199 | + |
| 200 | +### Training the model |
| 201 | +See [../train-wiki](../train-wiki). |
| 202 | + |
| 203 | +## Appendix |
| 204 | +### Section Lengths |
| 205 | +How long are the texts from the training examples? |
| 206 | +``` |
| 207 | +../data/length-distr.awk < train.pp.comments > length-distr.train.pp.comments |
| 208 | +gnuplot \ |
| 209 | + ../data/length-distr.plot \ |
| 210 | + -e "set ylabel 'p(length)'; plot '../data/length-distr.data.train.pp.comments' t 'Hacker News comments' w l ls 1, 'length-distr.train.pp.comments' t 'Wikipedia sections' w l ls 2" \ |
| 211 | + > length-distr.train.pp.comments.svg |
| 212 | +``` |
| 213 | + |
| 214 | + |
| 215 | +Interestingly, the distribution of Wikipedia sections is a lot less smooth than the one of Hacker News comments. |
| 216 | +Is this an effect of our data processing, or something inherent in the data? |
0 commit comments