Aligning Sentences from Standard Wikipedia to Simple Wikipedia

William Hwang, Hannaneh Hajishirzi, Mari Ostendorf, and Wei Wu.

Abstract

This work improves monolingual sentence alignment for text simplification, specifically for text in standard and simple Wikipedia. We introduce a method that improves over past efforts by using a greedy (vs. ordered) search over the document and a word-level semantic similarity score based on Wiktionary (vs. WordNet) that also accounts for structural similarity through syntactic dependencies. Experiments show improved performance on a hand-aligned set, with the largest gain coming from structural similarity. Resulting datasets of manually and automatically aligned sentence pairs are made available.

The full paper is available here.

Datasets

The WikNet scores can be found here. The manually and automatically (good, good partial, and uncategorized) aligned datasets are available here. The manually and automatically aligned sentence pairs were extracted from a total 68 K and 52 M sentence pairs, respectively. The number of good and good partial matches is as follows:

Results

Our method, the greedy search strategy with structural WikNet similarity significantly outperforms the other approaches, especially at high recall.

Support

This research was supported by the National Science Foundation grant numbers IIS-0916951 and IIS-1352249.

Citation

William Hwang, Hannaneh Hajishirzi, Mari Ostendorf, and Wei Wu. Aligning Sentences from Standard Wikipedia to Simple Wikipedia. NAACL-HLT, 2015.