Skip to content

Commit 763508c

Browse files
committed
wiki
1 parent 3d5177a commit 763508c

13 files changed

+58485
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# Wiki Data
2+
Here, we prepare training data from an English Wikipedia data dump.
3+
Nothing too interesting, just a series of tedious steps because I don't know any better tools for this.
4+
5+
## Steps
6+
### Download
7+
Download a dump of the English Wikipedia in XML format.
8+
This can be done here for example:
9+
[https://dumps.wikimedia.org/backup-index.html](https://dumps.wikimedia.org/backup-index.html).
10+
What we are interested in is a file like `enwiki-latest-pages-articles.xml.bz2`.
11+
12+
The dump that we experiment with here is named `enwiki-20190201-pages-articles.xml` and has a
13+
compressed size of 15GB.
14+
15+
### Split
16+
I use [xmldump2files.py](https://github.com/adamwulf/wikipedia2text/blob/master/xmldump2files.py)
17+
to split the XML dump into individual files, one per document:
18+
```
19+
bzcat enwiki-latest-pages-articles.xml.bz2 | ./xmldump2files.py /dev/stdin docs
20+
```
21+
Documents will be saved in some kind of hash tree in the `docs/` directory.
22+
For example, there will be the file `docs/2f/7c/Abraham_Lincoln.txt`.
23+
24+
### Filter Documents
25+
I get about 10M extracted documents, as `xmldump2files.log` shows:
26+
```
27+
Redirects 8465477 Deleted 0 Disambigs 271019 Lists 231905 Skipped 0 Wrote 10200000 50.01GiB Total 10200000 50.01GiB (185%)
28+
```
29+
The official
30+
[Wikipedia statistics](https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Annual_growth_rate_for_the_English_Wikipedia)
31+
say that there are currently about 5.8M articles.
32+
The dump that I downloaded contains a lot of non-articles.
33+
Note that [our version of xmldump2files.py](xmldump2files.py) already filters out redirects and disambiguations.
34+
The most serious remaining offenders can be found like this:
35+
```
36+
find docs/ -name '*.txt' \
37+
| grep -o '/[^/%]*%3' \
38+
| sort \
39+
| uniq -c \
40+
| awk '{print $1,$2}' \
41+
| sort -k1,1 -n \
42+
| tail -n20
43+
```
44+
I've collected them in `doc-list-filter.grep` to filter the document list:
45+
```
46+
find docs -name '*.txt' | grep -vF -f doc-list-filter.grep > docs.txt
47+
```
48+
I'm left with 5.78M of 10.2M documents.
49+
50+
### Convert to Markdown
51+
We convert from the MediaWiki markup to Markdown using [pandoc](https://pandoc.org/),
52+
together with a custom filter [filter_markdown.py](filter_markdown.py) written with
53+
[panflute](http://scorreia.com/software/panflute/)
54+
that removes content that is not useful for us.
55+
56+
Here's how to convert and filter one document:
57+
```
58+
pandoc --wrap=none -f mediawiki -t markdown < test/out/7a/77/Astronomer.txt \
59+
| pandoc --filter filter_markdown.py -t markdown
60+
```
61+
The script `convert-doc.sh` applies this conversion to stdin.
62+
We can use [GNU Parallel](http://www.gnu.org/s/parallel) to apply it to all articles,
63+
writing the output to the filename suffixed by `.md`:
64+
```
65+
parallel --verbose -j 8 ./convert-doc.sh '<' {} '>' {.}.md \
66+
< wiki/docs.txt \
67+
2>&1 | tee convert.log
68+
```
69+
This may take a few days.
70+
71+
There are some downsides to using Pandoc here, since it does not handle Wikipedia template
72+
references, and instead seems to remove them in the output. This leads to a few sentences
73+
missing words in the middle. This is a relatively rare occasion, so it should not be much
74+
of a problem.
75+
76+
Also, the conversion crashes sometimes:
77+
```
78+
/home/leod/src/hncynic/data-wiki/convert-doc.sh < docs/85/Munich%E2%80%93Augsburg_railway.txt > docs/85/Munich%E2%80%93Augsburg_railway.md
79+
Traceback (most recent call last):
80+
File "/home/leod/src/hncynic/data-wiki/filter_markdown.py", line 114, in <module>
81+
main()
82+
File "/home/leod/src/hncynic/data-wiki/filter_markdown.py", line 98, in main
83+
return run_filter(action, prepare=prepare, doc=doc)
84+
File "/home/leod/.local/lib/python3.6/site-packages/panflute/io.py", line 260, in run_filter
85+
return run_filters([action], *args, **kwargs)
86+
...
87+
File "/home/leod/.local/lib/python3.6/site-packages/panflute/elements.py", line 1061, in __init__
88+
self.header = header
89+
File "/home/leod/.local/lib/python3.6/site-packages/panflute/elements.py", line 1097, in header
90+
raise IndexError(msg)
91+
IndexError: table header has an incorrect number of cols: 6 rows but expected 8
92+
pandoc: Error running filter /home/leod/src/hncynic/data-wiki/filter_markdown.py
93+
Filter returned error status 1
94+
```
95+
How often?
96+
```
97+
$ grep "pandoc: Error running filter" convert.log | wc -l
98+
208757
99+
```
100+
This means we'll loose about 3.6\% of the articles while converting to Markdown.
101+
Not cool, but I can live with it.
102+
103+
### Convert to TSV
104+
We use each section of an article as an individual training example.
105+
106+
```
107+
find docs -name '*.md' > docs.md.txt
108+
parallel --verbose -j 8 \
109+
./clean_text.sh \
110+
'<' {} \
111+
'|' ./md_to_tsv.py {} \
112+
'>' {.}.tsv \
113+
< docs.md.txt \
114+
> convert.tsv.log 2>&1
115+
```
116+
117+
The resulting data is far from perfect, as e.g. it still contains some leftover Wiki markup.
118+
119+
## Concatenate
120+
Split into train/dev/test (this time it's easier because we have one file per title):
121+
```
122+
find docs -name '*.tsv' > docs.tsv.txt
123+
shuf docs.tsv.txt > docs.tsv.shuf.txt
124+
awk 'NR <= 2000' docs.tsv.shuf.txt > docs.tsv.dev.txt
125+
awk 'NR > 2000 && NR <= 4000' docs.tsv.shuf.txt > docs.tsv.test.txt
126+
awk 'NR > 4000' docs.tsv.shuf.txt > docs.tsv.train.txt
127+
```
128+
Sanity check:
129+
```
130+
$ sort -u docs.tsv.txt | wc -l
131+
5801101
132+
$ cat docs.tsv.{train,dev,test}.txt | sort -u | wc -l
133+
5801101
134+
```
135+
Concatenate:
136+
```
137+
cat $(cat docs.tsv.dev.txt) > dev.tsv
138+
cat $(cat docs.tsv.test.txt) > test.tsv
139+
140+
# SLOW:
141+
while read file; do cat $file; done < docs.tsv.train.txt > train.tsv
142+
```
143+
I found the last command for concatenating the training data to be quite slow (I estimated
144+
that it would take more than 1 day to complete). Maybe this is because of the overhead of
145+
starting a new `cat` process for each of the almost 6M files. I've written a small Python
146+
utility that makes this step run significantly faster:
147+
```
148+
./cat_stdin.py < docs.tsv.train.txt > train.tsv
149+
```
150+
151+
### Normalization?
152+
Now, we could again apply Moses preprocessing etc. but I'm not sure if it is the right way to go,
153+
due to all the special symbols such as Latex code and the triple backticks in Markdown. Also,
154+
Wikipedia text itself already is pretty well normalized, so we probably can get away without
155+
tokenization.
156+
157+
Okay, so for now the only normalization we do here is to lowercase the titles.
158+
```
159+
./preprocess_tsv.sh train
160+
./preprocess_tsv.sh dev
161+
./preprocess_tsv.sh test
162+
```
163+
164+
### Issues
165+
After all this, there still are a bunch of issues with the data. Here's what I know of:
166+
- [`md_to_tsv.py`](md_to_tsv.py) occasionally outputs a title like this:
167+
```
168+
2015 africa cup of nations qualification group e: ------------------------------------------------------------------------
169+
```
170+
This probably is because of a failed header detection. This happens in only 9908 of the 16593956
171+
titles in the training data.
172+
- The articles sometimes still contain table markup.
173+
- Even though I filtered many redirects, the training data still contains some.
174+
I count 15768 (0.1% of all examples) in the final training data.
175+
- As mentioned above, some Wikipedia template markup such as automatic unit conversion is not handled,
176+
resulting in incomplete sentences.
177+
178+
### BPE
179+
Similar to the [Hacker News data](../data), we learn a BPE word segmentation on the training data.
180+
We have a lot more training data here than before, so we use [fastBPE](https://github.com/glample/fastBPE),
181+
which is a faster implementation.
182+
183+
Still, learning BPE on the full data takes a long time, so let's just use a subsample:
184+
```
185+
paste train.pp.{titles,comments} | shuf > train.pp.shuf.titles-comments
186+
cut -f1 train.pp.shuf.titles-comments | head -n 2000000 > bpetrain.pp.titles
187+
cut -f2 train.pp.shuf.titles-comments | head -n 2000000 > bpetrain.pp.comments
188+
189+
fastBPE/fast learnbpe 32000 bpetrain.pp.titles bpetrain.pp.comments > bpecodes
190+
```
191+
Apply segmentation to data:
192+
```
193+
for i in {test,dev,train}; do
194+
for j in {comments,titles}; do
195+
fastBPE/fast applybpe $i.pp.bpe.$j $i.pp.$j bpecodes
196+
done
197+
done
198+
```
199+
200+
### Training the model
201+
See [../train-wiki](../train-wiki).
202+
203+
## Appendix
204+
### Section Lengths
205+
How long are the texts from the training examples?
206+
```
207+
../data/length-distr.awk < train.pp.comments > length-distr.train.pp.comments
208+
gnuplot \
209+
../data/length-distr.plot \
210+
-e "set ylabel 'p(length)'; plot '../data/length-distr.data.train.pp.comments' t 'Hacker News comments' w l ls 1, 'length-distr.train.pp.comments' t 'Wikipedia sections' w l ls 2" \
211+
> length-distr.train.pp.comments.svg
212+
```
213+
![length distribution](length-distr.train.pp.comments.svg)
214+
215+
Interestingly, the distribution of Wikipedia sections is a lot less smooth than the one of Hacker News comments.
216+
Is this an effect of our data processing, or something inherent in the data?
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/usr/bin/env python3
2+
3+
import sys
4+
5+
for fname in sys.stdin:
6+
with open(fname[:-1]) as f:
7+
for line in f:
8+
sys.stdout.write(line)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
perl -pe 'BEGIN{undef $/;} s#\{\{.*\}\}##s' \
2+
| sed -e 's/()//g' -e 's/\*\*\([^*]*\)\*\*/_\1_/g' -e 's/\*\([^*]*\)\*/_\1_/g' -e 's/<!--[^-]*-->//g'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
perl -pe 'BEGIN{undef $/;} s#<gallery>.*</gallery>##s' \
2+
| perl -pe 'BEGIN{undef $/;} s#<timeline>.*</timeline>##s' \
3+
| sed -e 's/{{abbr|[^|]*|\([^}]\+\)}}/\1/g' \
4+
-e 's/{{as of|[^|}]*|alt=\([^|}]*\)}}/\1/g' \
5+
-e 's/{{as of|\([^|}]*\)}}/as of \1/g' \
6+
-e 's/{{as of|\([^|]*\)|[^}]*}}/as of \1/g' \
7+
-e 's/{{convert|\([^|]*\)|\([a-zA-Z]*\)[^}]*}}/\1 \2/g' \
8+
| pandoc --wrap=none -f mediawiki --filter $(dirname $0)/filter_markdown.py -t markdown
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
/Category%3
2+
/File%3
3+
/Template%3
4+
/Wikipedia%3
5+
/Help%3
6+
/MediaWiki%3
7+
/Book%3
8+
/Module%3
9+
/Draft%3
10+
/Portal%3
11+
/Index_of_
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Do stuff
4+
"""
5+
6+
import sys
7+
8+
from panflute import *
9+
10+
# Sections that we filter out completely
11+
SECTION_IGNORE = [
12+
'See also',
13+
'Notes',
14+
'References',
15+
'External links',
16+
'Gallery',
17+
'Works cited',
18+
'External links and suggested reading',
19+
'Editions',
20+
'Discography',
21+
'Filmography',
22+
'Notes and references',
23+
'Further reading',
24+
'Works',
25+
'Awards',
26+
'Patents & awards',
27+
]
28+
29+
# Elements that we ignore
30+
ELEMENT_IGNORE = [
31+
Image,
32+
Note, # footnotes and endnotes
33+
Table,
34+
RawBlock,
35+
RawInline,
36+
SmallCaps,
37+
Strikeout,
38+
Subscript,
39+
Superscript,
40+
]
41+
42+
def prepare(doc):
43+
doc.my_current_section = None
44+
45+
def action(elem, doc):
46+
#sys.stderr.write(repr(elem) + '\n')
47+
48+
# We must not filter out Doc, that leads to errors
49+
if isinstance(elem, Doc):
50+
return elem
51+
52+
# Filter certain sections
53+
if doc.my_current_section is not None and doc.my_current_section in SECTION_IGNORE:
54+
return []
55+
56+
if isinstance(elem, Link):
57+
# For links, only keep the description text
58+
descr = elem.content.list
59+
60+
# Filter links that don't have any description
61+
if len(descr) == 0:
62+
return []
63+
64+
# Filter strange thumbnail links
65+
#
66+
# E.g. there is a link whose description is:
67+
# [Str(thumb|upright=1.1|),
68+
# Emph(Str([The) Space Str(Astronomer](The_Astronomer_(Vermeer)) Space Str("wikilink"))),
69+
# Space, Str(by), Space, Str([Johannes), Space, Str(Vermeer](Johannes_Vermeer), Space,
70+
# Str("wikilink"))]
71+
link_str = stringify(elem)
72+
if (isinstance(descr[0], Str) and 'thumb|' in descr[0].text) or 'thumb|' in link_str:
73+
return []
74+
75+
# Also ignore links to Wikipedia media
76+
if link_str.startswith('File:') or link_str.startswith('Category:'):
77+
return []
78+
79+
return descr
80+
elif any([isinstance(elem, t) for t in ELEMENT_IGNORE]):
81+
return []
82+
elif isinstance(elem, Header):
83+
#if elem.level == 2:
84+
doc.my_current_section = stringify(elem)
85+
if doc.my_current_section in SECTION_IGNORE:
86+
return []
87+
elif isinstance(elem, Strong):
88+
# Hacker News only has Emph
89+
return Emph(*elem.content)
90+
elif isinstance(elem, Emph):
91+
# Hacker News only has Emph
92+
if len(elem.content) == 1 and isinstance(elem.content[0], Emph):
93+
return elem.content[0]
94+
else:
95+
return elem
96+
97+
def main(doc=None):
98+
return run_filter(action, prepare=prepare, doc=doc)
99+
100+
if __name__ == '__main__':
101+
# If I don't do the following, I get:
102+
#
103+
# $ pandoc --wrap=none -f mediawiki -t markdown --filter filter_markdown.py < wiki/docs/00/New_Holland%2C_South_Dakota.txt
104+
# [...]
105+
# File "[...]/panflute/elements.py", line 695, in __init__
106+
# self.format = check_group(format, RAW_FORMATS)
107+
# File "[...]/panflute/utils.py", line 34, in check_group
108+
# raise TypeError(msg)
109+
# TypeError: element str not in group {'rtf', 'noteref', 'openxml', 'opendocument', 'latex', 'icml', 'html', 'context', 'tex'}
110+
# pandoc: Error running filter filter_markdown.py
111+
# Filter returned error status 1
112+
113+
elements.RAW_FORMATS.add('mediawiki')
114+
main()

0 commit comments

Comments
 (0)