Since my last post, you’ve maybe played around with WebChamame a bit. That site is a powerful tool to see what’s possible in natural language processing, which is a method used in linguistics and other social sciences to create statistics from texts. Although with WebChamame you are able to determine a number of settings and output your data into downloadable files, however, the site has its limitations. Most prominent among them: it takes a long time to find the bits of information you might be interested in.
Statisticians created the computer language R and the editor RStudio to overcome that problem because they need to wrangle data quickly to produce useful information. It is also possible to tokenize premodern Japanese texts in R and RStudio, but it requires installing a program for morphological analysis on your own computer. As you might have noticed in the last post, WebChamame uses MeCab for its morphological analyses.
If you haven’t already, you will need to install R from the Comprehensive R Archive Network (CRAN). This gives you the basic packages for R, called “base R,” which means you will have the necessary software to run a range of statistical analyses. For more niche computing, you will need to later install additional packages. RMeCab is one such package.
To use MeCab with R, you will also need RMeCab, but I am getting ahead of myself.
Then, for convenience, you will need to install RStudio for your operating system, in our case Macs. For beginners, the free license allows you to do everything you need. RStudio is an Interactive Development Environment, which means you can open scripts and run code within one computer program. By the end of this post, you’ll know what a script is and I’ll give you one for installing RMeCab and getting it to run as well.
Now for the difficult part. MeCab is both state-of-the-art and rather old. Old means two very different things in computer science and literary studies: The newest update is from 2013, which is almost a decade ago already! That means installing MeCab requires doing some things by hand.
Next, open a Terminal window on your Mac.
Then type or paste the code below (without the dollar sign) after the percent symbol in your open Terminal. I found this code by Ishida Motohiro, professor of linguistics at Tokushima University and the author of RMeCab, on his Github site.
$ xcode-select --install
Continue by typing or pasting the following code line by line (again without the dollar signs marking each line) into your Terminal. It will download the software and install it on your computer.
You won’t be able to go back and edit a line once you’ve inserted it, but you can delete and redo it or try again if you get an error.
$ cd ~/Downloads $ curl -fsSL 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE' -o mecab-0.996.tar.gz $ tar xf mecab-0.996.tar.gz $ cd mecab-0.996 $ ./configure --with-charset=utf8 $ make $ sudo make install $ cd ~/Downloads $ curl -fsSL 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM' -o mecab-ipadic-2.7.0-20070801.tar.gz $ tar zvxf mecab-ipadic-2.7.0-20070801.tar.gz $ tar xf mecab-ipadic-2.7.0-20070801.tar.gz $ cd mecab-ipadic-2.7.0-20070801 $ ./configure --with-charset=utf-8 $ make $ sudo make install
In effect this installs two files on your computer, MeCab and the IPA dictionary that it recommends. (I will write another post on how to install the UniDic dictionaries for premodern Japanese.) You can now check if it’s working by inputting the following test:
$ mecab すもももももももものうち
This is a tongue twister that means, “Japanese prunes (Prunus salicina) and peaches (Prunus persica) are both plants of the prune genus (Prunus).” If you hit enter, you will hopefully get something like the following:
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ も 助詞,係助詞,*,*,*,*,も,モ,モ もも 名詞,一般,*,*,*,*,もも,モモ,モモ も 助詞,係助詞,*,*,*,*,も,モ,モ もも 名詞,一般,*,*,*,*,もも,モモ,モモ の 助詞,連体化,*,*,*,*,の,ノ,ノ うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ EOS
Congratulations! You successfully installed MeCab! Do you feel like a coder yet? You will once you switch now to another computer language: R.
The last step is to install RMeCab so that you can use MeCab in the R environment. RMeCab is an R package that works as an interface between R and MeCab and was written by Prof. Ishida. Open R or RStudio and paste the following line of code in the console (after the “>” sign) and hit “return.”
install.packages("RMeCab", repos = "https://rmecab.jp/R", type = "source")
Now RMeCab is installed in your version of R. After installation, there is one more step before you can use an R package: you need to load RMeCab for this project.
Now RMeCab is loaded. Whenever you use it in a new project, you will not have to install the package, but you will have to load it this way.
How do you get R to output the same results as our test in the Terminal? You have to tell R to parse the sentence, like this:
res <- RMeCabC("すもももももももものうち")
I used “res” here to mean “result,” but you could use whatever variable name you like. Finally, you have to tell R to print out the results, like this:
Did you see the tongue twister parsed in your console? Try this with another sentence in modern Japanese. And there you go. You can now parse sentences in modern Japanese in R!
I have saved this R code in a script entitled “StartingWithRMeCab.r” that you can find on a Github repository I made for this blog. Instead of copy-pasting each line into your console, you can open the script in RStudio and execute each line individually by clicking somewhere on it and typing (on a Mac) “command”+”return.”
Please let me know if you’re successful! I’m sharing this as I learn, so I would love to hear what works and what doesn’t.
Tsutsumi, Tomoaki, and Toshinobu Ogiso. 2015. “Rekishiteki shiryō wo taisho to shita fukusū no UniDic jisho ni yoru keitaisokaiseki shien tsūru ‘WebChamame.’” In Jinbun kagaku to konpyūta shinpojiumu. http://id.nii.ac.jp/1001/00146542/.