Automatic Text Difficulty Classifier Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching --paper

the system uses existing Natural Language Processing (NLP) tools, a parser and an hyphenator, and two corpora, previously annotated by readability level.

hyphenator：

h_en.pairs('beautiful'
[['beau', 'tiful'], [u'beauti', 'ful']]

the system extracts 52 features, grouped in 7 groups: parts-of-speech (POS), syllables, words, chunks and phrases, averages and frequencies, and some extra features.

语言：葡萄牙语

one based on a five-levels scale
(A1, A2, B1, B2, C1)
a second experiment based in a simplified
three-levels scale (A, B and C)

3 nlp工具
STRING：相当于葡萄牙语的nltk
The YAH Hyphenator：This is a rule-based system that applies
various word processing division rules.

hypotaxis 从属结构
parataxis 并列结构

4 特征
The set of 52 features extracted by the system consists
in: (i) part-of-speech (POS) tags, chunks, words
and sentences features; (ii) verb features and different
metrics involving averages and frequencies; (iii)
several metrics involving syllables; and (iv) extra features.

名词、命名体识别对文本理解很重要
句法结构：名词短语、介词短语
助动词可以形成更长更复杂的动词链
hypotaxis 从属结构
parataxis 并列结构
Word frequency：unigram-based，拉普拉斯平滑
动词、名词比例，句长