Categories
Text Mining

Installing MeCab and RMeCab on a Mac

Since my last post, you’ve maybe played around with WebChamame a bit. That site is a powerful tool to see what’s possible in natural language processing, which is a method used in linguistics and other social sciences to create statistics from texts. Although with WebChamame you are able to determine a number of settings and output your data into downloadable files, however, the site has its limitations. Most prominent among them: it takes a long time to find the bits of information you might be interested in.

Statisticians created the computer language R and the editor RStudio to overcome that problem because they need to wrangle data quickly to produce useful information. It is also possible to tokenize premodern Japanese texts in R and RStudio, but it requires installing a program for morphological analysis on your own computer. As you might have noticed in the last post, WebChamame uses MeCab for its morphological analyses.

This is an illustration from Tsutsumi and Ogiso (2015) of WebChamame’s workflow.
Categories
Text Mining

Breaking Digital Premodern Japanese Texts Down into Words using WebChamame

If you want to analyze Japanese texts digitally, the first problem you might run up against is that Japanese does not use spaces between words. A computer needs those spaces to know when one word ends and the next begins. So, you first need to be able to “tokenize” those words, that is determine the words. Deciding what is a word and what is not is difficult to decide. Is a verb ending a word or a part of a word? Linguists discuss these kinds of questions for us literary scholars and have created the necessary tools so that we don’t have to insert spaces manually into a text. Imagine how much work that would be!

This is an image from Den, et al., 2007, illustrating the different levels of differentiation between words in modern Japanese, taking into account complex composite nouns and verbs, in this case verbs combining a noun with the irregular verb “to do” (suru).

One way to see how computers can tokenize words (without installing anything on your own computer) is to use WebChamame. This site was built by researchers at the National Institute of Japanese Language and Linguistics. To get started, type a Japanese sentence into the left window on the WebChamame site.

This is a screengrab of the WebChamame website on June 28, 2022. Type your text into the box under 「テキストを入力」 to see what this site can do.