David Porter's Chinese Text Sampler as a Single, Convenient Download

Discussion in 'Document Exchange' started by Captain Planet, May 30, 2015.

  1. There's this wonderful website by David Porter at the University of Michigan that collects samples of Chinese text at different levels of proficiency for easy practice. Each text is introduced with a brief paragraph in English.

    I adapted this collection for offline use with Pleco Reader. Namely:
    • Downloaded all of the text samples. In a few cases, the downloads were incorrectly linked or broken, but the actual filename could be guessed, and where it couldn't, I copied the text from the online HTML version, so that nothing is missing in the end. Note that I didn't download the most frequent characters and surnames, as there are better ways to study that, as well as the Three Character and the Thousand Character classics, which I already had as part of my collection posted in another thread. I did however include the excerpts from The Analects and all the other texts my other collection has the full version of. Other than these minor differences, everything else from the website is here as well.
    • Converted all the files to UTF-8 with BOM (the original encoding was GB 2312), Windows (CR+LF) EOLs.
    • Converted the text into full-form (traditional) characters using OpenCC, profile s2tw.json. I left a few files with simplified characters, and their names are prefixed with an [S].
    • Did a little cleanup using regular expressions (unwrap text, separate paragraphs, remove indents, remove unnecessary newlines, fix punctuation, other fixes where necessary) and added some finishing touches manually.
    • For each of the texts I copied the introductory descriptive paragraph from the website and pasted it at the top so that when you open any of them, you'll see a brief description in English first.
    Categorization. The files are split into seven categories: 古典經文 Old Classics, 政治社會 Politics & Society, 故事傳說 Stories & Legends, 生活環境 Living Environment, 當代文學 Contemporary Literature, 電影劇稿 Film Scripts, and 音樂歌詞 Song Lyrics. Note that this is different than the way the texts are arranged on the website.

    Naming convention. File names are prepended with [n.n], where n.n is the difficulty grade as indicated on the website, theoretically ranging from 1.0 to 7.0 (1.6 to 5.4 among all the included texts). It appears that this grade is based on the number and rarity of characters appearing in the text, and not on actual difficulty in comprehension: in other words, if it's a classical text, you might very well know all the characters but not be able to make much sense of them, and the number does not account for that kind of difficulty. Still, it's better than nothing, so the files are named in such a way to make sure they'll be sorted from the least to most challenging in their respective category.

    What's included:
    Preview: 茉莉花 (most texts are much longer)
    Download below. The collection reflects what was on the website as of May 2015. Enjoy!

    Attached Files:

    Last edited: May 31, 2015

Share This Page