Aligning Sentences from Standard Wikipedia to Simple Wikipedia
This work improves monolingual sentence alignment for text simplification, specifically for text in standard and simple Wikipedia. We introduce a method that improves over past efforts by using a greedy (vs. ordered) search over the document and a word-level semantic similarity score based on Wiktionary (vs. WordNet) that also accounts for structural similarity through syntactic dependencies. Experiments show improved performance on a hand-aligned set, with the largest gain coming from structural similarity. Resulting datasets of manually and automatically aligned sentence pairs are made available.
The full paper is available here
The WikNet scores can be found here
. The manually
and automatically (good
, good partial
, and uncategorized
) aligned datasets are available here. The manually and automatically aligned sentence pairs were extracted from a total 68 K and 52 M sentence pairs, respectively. The orignial processed Wikipedia (both English and simplified version) is here
, The number of good and good partial matches is as follows:
Our method, the greedy search strategy with structural WikNet similarity significantly outperforms the other approaches, especially at high recall.
This research was supported by the National Science Foundation grant numbers IIS-0916951 and IIS-1352249.
William Hwang, Hannaneh Hajishirzi, Mari Ostendorf, and Wei Wu. Aligning Sentences from Standard Wikipedia to Simple Wikipedia. NAACL-HLT, 2015