TarrySingh
diff --git a/‎Projects/hackernews-comments/data-wiki/README.md
+216 b/‎Projects/hackernews-comments/data-wiki/README.md
+216
diff --git a/‎Projects/hackernews-comments/data-wiki/cat_stdin.py
+8 b/‎Projects/hackernews-comments/data-wiki/cat_stdin.py
+8
diff --git a/‎Projects/hackernews-comments/data-wiki/clean_text.sh
+2 b/‎Projects/hackernews-comments/data-wiki/clean_text.sh
+2
diff --git a/‎Projects/hackernews-comments/data-wiki/convert-doc.sh
+8 b/‎Projects/hackernews-comments/data-wiki/convert-doc.sh
+8
diff --git a/‎Projects/hackernews-comments/data-wiki/doc-list-filter.grep
+11 b/‎Projects/hackernews-comments/data-wiki/doc-list-filter.grep
+11
diff --git a/‎Projects/hackernews-comments/data-wiki/filter_markdown.py
+114 b/‎Projects/hackernews-comments/data-wiki/filter_markdown.py
+114
@@ -0,0 +1,216 @@
+# Wiki Data 
+Here, we prepare training data from an English Wikipedia data dump.
+Nothing too interesting, just a series of tedious steps because I don't know any better tools for this.
+
+## Steps
+### Download
+Download a dump of the English Wikipedia in XML format.
+This can be done here for example:
+[https://dumps.wikimedia.org/backup-index.html](https://dumps.wikimedia.org/backup-index.html).
+What we are interested in is a file like `enwiki-latest-pages-articles.xml.bz2`.
+
+The dump that we experiment with here is named `enwiki-20190201-pages-articles.xml` and has a
+compressed size of 15GB.
+
+### Split
+I use [xmldump2files.py](https://github.com/adamwulf/wikipedia2text/blob/master/xmldump2files.py) 
+to split the XML dump into individual files, one per document:
+```
+bzcat enwiki-latest-pages-articles.xml.bz2 | ./xmldump2files.py /dev/stdin docs
+```
+Documents will be saved in some kind of hash tree in the `docs/` directory.
+For example, there will be the file `docs/2f/7c/Abraham_Lincoln.txt`.
+
+### Filter Documents
+I get about 10M extracted documents, as `xmldump2files.log` shows:
+```
+Redirects 8465477  Deleted 0  Disambigs 271019  Lists 231905  Skipped 0  Wrote 10200000 50.01GiB  Total 10200000 50.01GiB  (185%)
+```
+The official
+[Wikipedia statistics](https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Annual_growth_rate_for_the_English_Wikipedia)
+say that there are currently about 5.8M articles.
+The dump that I downloaded contains a lot of non-articles.
+Note that [our version of xmldump2files.py](xmldump2files.py) already filters out redirects and disambiguations.
+The most serious remaining offenders can be found like this:
+```
+find docs/ -name '*.txt' \
+  | grep -o '/[^/%]*%3' \
+  | sort \
+  | uniq -c \
+  | awk '{print $1,$2}' \
+  | sort -k1,1 -n \
+  | tail -n20
+```
+I've collected them in `doc-list-filter.grep` to filter the document list:
+```
+find docs -name '*.txt' | grep -vF -f doc-list-filter.grep > docs.txt
+```
+I'm left with 5.78M of 10.2M documents.
+
+### Convert to Markdown
+We convert from the MediaWiki markup to Markdown using [pandoc](https://pandoc.org/),
+together with a custom filter [filter_markdown.py](filter_markdown.py) written with
+[panflute](http://scorreia.com/software/panflute/)
+that removes content that is not useful for us.
+
+Here's how to convert and filter one document:
+```
+pandoc --wrap=none -f mediawiki -t markdown < test/out/7a/77/Astronomer.txt \
+  | pandoc --filter filter_markdown.py -t markdown
+```
+The script `convert-doc.sh` applies this conversion to stdin.
+We can use [GNU Parallel](http://www.gnu.org/s/parallel) to apply it to all articles,
+writing the output to the filename suffixed by `.md`:
+```
+parallel --verbose -j 8 ./convert-doc.sh '<' {} '>' {.}.md \
+  < wiki/docs.txt \
+  2>&1 | tee convert.log
+```
+This may take a few days.
+
+There are some downsides to using Pandoc here, since it does not handle Wikipedia template
+references, and instead seems to remove them in the output. This leads to a few sentences
+missing words in the middle. This is a relatively rare occasion, so it should not be much
+of a problem.
+
+Also, the conversion crashes sometimes:
+```
+/home/leod/src/hncynic/data-wiki/convert-doc.sh < docs/85/Munich%E2%80%93Augsburg_railway.txt > docs/85/Munich%E2%80%93Augsburg_railway.md
+Traceback (most recent call last):
+  File "/home/leod/src/hncynic/data-wiki/filter_markdown.py", line 114, in <module>
+    main()
+  File "/home/leod/src/hncynic/data-wiki/filter_markdown.py", line 98, in main
+    return run_filter(action, prepare=prepare, doc=doc)
+  File "/home/leod/.local/lib/python3.6/site-packages/panflute/io.py", line 260, in run_filter
+    return run_filters([action], *args, **kwargs)
+...
+  File "/home/leod/.local/lib/python3.6/site-packages/panflute/elements.py", line 1061, in __init__
+    self.header = header
+  File "/home/leod/.local/lib/python3.6/site-packages/panflute/elements.py", line 1097, in header
+    raise IndexError(msg)
+IndexError: table header has an incorrect number of cols: 6 rows but expected 8
+pandoc: Error running filter /home/leod/src/hncynic/data-wiki/filter_markdown.py
+Filter returned error status 1
+```
+How often?
+```
+$ grep "pandoc: Error running filter" convert.log | wc -l
+208757
+```
+This means we'll loose about 3.6\% of the articles while converting to Markdown.
+Not cool, but I can live with it.
+
+### Convert to TSV
+We use each section of an article as an individual training example.
+
+```
+find docs -name '*.md' > docs.md.txt
+parallel --verbose -j 8 \
+  ./clean_text.sh \
+    '<' {} \
+    '|' ./md_to_tsv.py {} \
+    '>' {.}.tsv \
+  < docs.md.txt \
+  > convert.tsv.log 2>&1
+```
+
+The resulting data is far from perfect, as e.g. it still contains some leftover Wiki markup.
+
+## Concatenate
+Split into train/dev/test (this time it's easier because we have one file per title):
+```
+find docs -name '*.tsv' > docs.tsv.txt
+shuf docs.tsv.txt > docs.tsv.shuf.txt
+awk 'NR <= 2000' docs.tsv.shuf.txt > docs.tsv.dev.txt
+awk 'NR > 2000 && NR <= 4000' docs.tsv.shuf.txt > docs.tsv.test.txt
+awk 'NR > 4000' docs.tsv.shuf.txt > docs.tsv.train.txt
+```
+Sanity check:
+```
+$ sort -u docs.tsv.txt | wc -l
+5801101
+$ cat docs.tsv.{train,dev,test}.txt | sort -u | wc -l
+5801101
+```
+Concatenate:
+```
+cat $(cat docs.tsv.dev.txt) > dev.tsv
+cat $(cat docs.tsv.test.txt) > test.tsv
+
+# SLOW:
+while read file; do cat $file; done < docs.tsv.train.txt > train.tsv
+```
+I found the last command for concatenating the training data to be quite slow (I estimated
+that it would take more than 1 day to complete). Maybe this is because of the overhead of
+starting a new `cat` process for each of the almost 6M files. I've written a small Python
+utility that makes this step run significantly faster:
+```
+./cat_stdin.py < docs.tsv.train.txt > train.tsv
+```
+
+### Normalization?
+Now, we could again apply Moses preprocessing etc. but I'm not sure if it is the right way to go,
+due to all the special symbols such as Latex code and the triple backticks in Markdown. Also,
+Wikipedia text itself already is pretty well normalized, so we probably can get away without
+tokenization.
+
+Okay, so for now the only normalization we do here is to lowercase the titles.
+```
+./preprocess_tsv.sh train
+./preprocess_tsv.sh dev
+./preprocess_tsv.sh test
+```
+
+### Issues
+After all this, there still are a bunch of issues with the data. Here's what I know of:
+- [`md_to_tsv.py`](md_to_tsv.py) occasionally outputs a title like this:
+  ```
+  2015 africa cup of nations qualification group e: ------------------------------------------------------------------------
+  ```
+  This probably is because of a failed header detection. This happens in only 9908 of the 16593956
+  titles in the training data.
+- The articles sometimes still contain table markup.
+- Even though I filtered many redirects, the training data still contains some.
+  I count 15768 (0.1% of all examples) in the final training data.
+- As mentioned above, some Wikipedia template markup such as automatic unit conversion is not handled,
+  resulting in incomplete sentences.
+
+### BPE
+Similar to the [Hacker News data](../data), we learn a BPE word segmentation on the training data.
+We have a lot more training data here than before, so we use [fastBPE](https://github.com/glample/fastBPE),
+which is a faster implementation.
+
+Still, learning BPE on the full data takes a long time, so let's just use a subsample:
+```
+paste train.pp.{titles,comments} | shuf > train.pp.shuf.titles-comments
+cut -f1 train.pp.shuf.titles-comments | head -n 2000000 > bpetrain.pp.titles
+cut -f2 train.pp.shuf.titles-comments | head -n 2000000 > bpetrain.pp.comments
+
+fastBPE/fast learnbpe 32000 bpetrain.pp.titles bpetrain.pp.comments > bpecodes
+```
+Apply segmentation to data:
+```
+for i in {test,dev,train}; do
+  for j in {comments,titles}; do
+    fastBPE/fast applybpe $i.pp.bpe.$j $i.pp.$j bpecodes
+  done
+done
+```
+
+### Training the model
+See [../train-wiki](../train-wiki).
+
+## Appendix
+### Section Lengths
+How long are the texts from the training examples?
+```
+../data/length-distr.awk < train.pp.comments > length-distr.train.pp.comments
+gnuplot \
+  ../data/length-distr.plot \
+  -e "set ylabel 'p(length)'; plot '../data/length-distr.data.train.pp.comments' t 'Hacker News comments' w l ls 1, 'length-distr.train.pp.comments' t 'Wikipedia sections' w l ls 2" \
+  > length-distr.train.pp.comments.svg
+```
+![length distribution](length-distr.train.pp.comments.svg)
+
+Interestingly, the distribution of Wikipedia sections is a lot less smooth than the one of Hacker News comments.
+Is this an effect of our data processing, or something inherent in the data?
@@ -0,0 +1,8 @@
+#!/usr/bin/env python3
+
+import sys
+
+for fname in sys.stdin:
+  with open(fname[:-1]) as f:
+    for line in f:
+      sys.stdout.write(line)
@@ -0,0 +1,2 @@
+perl -pe 'BEGIN{undef $/;} s#\{\{.*\}\}##s' \
+  | sed -e 's/()//g' -e 's/\*\*\([^*]*\)\*\*/_\1_/g' -e 's/\*\([^*]*\)\*/_\1_/g' -e 's/<!--[^-]*-->//g'
@@ -0,0 +1,8 @@
+perl -pe 'BEGIN{undef $/;} s#<gallery>.*</gallery>##s' \
+  | perl -pe 'BEGIN{undef $/;} s#<timeline>.*</timeline>##s' \
+  | sed -e 's/{{abbr|[^|]*|\([^}]\+\)}}/\1/g' \
+        -e 's/{{as of|[^|}]*|alt=\([^|}]*\)}}/\1/g' \
+        -e 's/{{as of|\([^|}]*\)}}/as of \1/g' \
+        -e 's/{{as of|\([^|]*\)|[^}]*}}/as of \1/g' \
+				-e 's/{{convert|\([^|]*\)|\([a-zA-Z]*\)[^}]*}}/\1 \2/g' \
+  | pandoc --wrap=none -f mediawiki --filter $(dirname $0)/filter_markdown.py -t markdown
@@ -0,0 +1,11 @@
+/Category%3
+/File%3
+/Template%3
+/Wikipedia%3
+/Help%3
+/MediaWiki%3
+/Book%3
+/Module%3
+/Draft%3
+/Portal%3
+/Index_of_
@@ -0,0 +1,114 @@
+#!/usr/bin/env python3
+"""
+Do stuff
+"""
+
+import sys
+
+from panflute import *
+
+# Sections that we filter out completely
+SECTION_IGNORE = [
+  'See also',
+  'Notes',
+  'References',
+  'External links',
+  'Gallery',
+  'Works cited',
+  'External links and suggested reading',
+  'Editions',
+  'Discography',
+  'Filmography',
+  'Notes and references',
+  'Further reading',
+  'Works',
+  'Awards',
+  'Patents & awards',
+]
+
+# Elements that we ignore
+ELEMENT_IGNORE = [
+  Image,
+  Note, # footnotes and endnotes
+  Table,
+  RawBlock,
+  RawInline,
+  SmallCaps,
+  Strikeout,
+  Subscript,
+  Superscript,
+]
+
+def prepare(doc):
+  doc.my_current_section = None
+
+def action(elem, doc):
+  #sys.stderr.write(repr(elem) + '\n')
+
+  # We must not filter out Doc, that leads to errors
+  if isinstance(elem, Doc):
+    return elem
+
+  # Filter certain sections
+  if doc.my_current_section is not None and doc.my_current_section in SECTION_IGNORE:
+    return []
+
+  if isinstance(elem, Link):
+    # For links, only keep the description text
+    descr = elem.content.list
+
+    # Filter links that don't have any description
+    if len(descr) == 0:
+      return []
+
+    # Filter strange thumbnail links
+    #
+    # E.g. there is a link whose description is:
+    # [Str(thumb|upright=1.1|),
+    #  Emph(Str([The) Space Str(Astronomer](The_Astronomer_(Vermeer)) Space Str("wikilink"))),
+    #  Space, Str(by), Space, Str([Johannes), Space, Str(Vermeer](Johannes_Vermeer), Space,
+    #  Str("wikilink"))]
+    link_str = stringify(elem)
+    if (isinstance(descr[0], Str) and 'thumb|' in descr[0].text) or 'thumb|' in link_str:
+      return []
+
+    # Also ignore links to Wikipedia media
+    if link_str.startswith('File:') or link_str.startswith('Category:'):
+      return []
+
+    return descr
+  elif any([isinstance(elem, t) for t in ELEMENT_IGNORE]):
+    return []
+  elif isinstance(elem, Header):
+    #if elem.level == 2:
+    doc.my_current_section = stringify(elem)
+    if doc.my_current_section in SECTION_IGNORE:
+      return []
+  elif isinstance(elem, Strong):
+    # Hacker News only has Emph
+    return Emph(*elem.content)
+  elif isinstance(elem, Emph):
+    # Hacker News only has Emph
+    if len(elem.content) == 1 and isinstance(elem.content[0], Emph):
+      return elem.content[0]
+  else:
+    return elem
+
+def main(doc=None):
+  return run_filter(action, prepare=prepare, doc=doc) 
+
+if __name__ == '__main__':
+  # If I don't do the following, I get:
+	#
+  # $ pandoc --wrap=none -f mediawiki -t markdown --filter filter_markdown.py < wiki/docs/00/New_Holland%2C_South_Dakota.txt
+	# [...]
+  #   File "[...]/panflute/elements.py", line 695, in __init__
+  #     self.format = check_group(format, RAW_FORMATS)
+  #   File "[...]/panflute/utils.py", line 34, in check_group
+  #     raise TypeError(msg)
+  # TypeError: element str not in group {'rtf', 'noteref', 'openxml', 'opendocument', 'latex', 'icml', 'html', 'context', 'tex'}
+  # pandoc: Error running filter filter_markdown.py
+  # Filter returned error status 1
+
+  elements.RAW_FORMATS.add('mediawiki')
+  main()
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+perl -pe 'BEGIN{undef $/;} s#\{\{.*\}\}##s' \`
	`2`	`+ \| sed -e 's/()//g' -e 's/\\\([^]\)\\/_\1_/g' -e 's/\\([^]\)\/_\1_/g' -e 's/<!--[^-]*-->//g'`