Media-related vocabulary gathering project

Shun

状元
Many thanks, your lists look very clean and realistic! Now if we tagged each occurrence of a word with a date and scraped articles for a year, we could find some "words of the month" for each month or draw interesting frequency graphs. :) (There may even be a seasonality for some expressions?)

Thanks also for your example from an article. Of course, an important function of such articles is to boost readers' morale, see them through this dark stretch, and to unite them. I see the word 英雄 (hero) is quite frequent, at the 1603rd place.
 
Last edited:

BenJackson

秀才
I'm actually scraping the data into a database, which is very helpful for de-duping articles. Besides the natural duplication I see due to checking the whole feed each day, I also see them posting the same article on more than one of their sites, and even occasionally on multiple days (sometimes attributed to different authors!). So I do have all of the metadata (like dates) so more analysis will be possible later.

In fact, pretty soon I will need to get smarter about the analysis step. Right now I just extract all of the titles and article bodies into a giant flat text file and treat it like a giant novel. If I keep going for a year I'm definitely going to have to change to just incrementally analyze new articles and keep partially tabulated data in the DB, which will naturally produce the time series you are thinking of.

Then I start daydreaming about using AWS Lambda to do the scraping and updating, keeping the DB in the cloud, etc. Then I realize why people charge money for pre-existing corpuses and I consider just buying one!

Oh BTW, 英雄 is actually even more common in SUBTLEX at 1079. People love their heroes!
 

Shun

状元
Interesting, nice project design! An incremental DB sounds good. I think whether you should go all-out with AWS depends on if you see it as enough of a challenge for yourself.

Thanks for your observation on the frequency of 英雄 in SUBTLEX. This gives me the feeling that the Chinese think in terms of heroes more than most Westerners do, perhaps due to their regard for their many legends. (such as, more recently, Lei Feng)
 
Last edited:
Top