Text Mining

Breaking Digital Premodern Japanese Texts Down into Words using WebChamame

If you want to analyze Japanese texts digitally, the first problem you might run up against is that Japanese does not use spaces between words. A computer needs those spaces to know when one word ends and the next begins. So, you first need to be able to “tokenize” those words, that is determine the words. Deciding what is a word and what is not is difficult to decide. Is a verb ending a word or a part of a word? Linguists discuss these kinds of questions for us literary scholars and have created the necessary tools so that we don’t have to insert spaces manually into a text. Imagine how much work that would be!

This is an image from Den, et al., 2007, illustrating the different levels of differentiation between words in modern Japanese, taking into account complex composite nouns and verbs, in this case verbs combining a noun with the irregular verb “to do” (suru).

One way to see how computers can tokenize words (without installing anything on your own computer) is to use WebChamame. This site was built by researchers at the National Institute of Japanese Language and Linguistics. To get started, type a Japanese sentence into the left window on the WebChamame site.

This is a screengrab of the WebChamame website on June 28, 2022. Type your text into the box under 「テキストを入力」 to see what this site can do.

I used this first line from the Tale of Genji by Murasaki Shikibu:


Source: JTI

Then, where it says “dictionary selection” 辞書選択 (jisho sentaku) you have to choose what type of Japanese your text is from a range of options. The dictionaries WebChamame uses are the UniDic digital dictionaries compiled by the National Institute of Japanese Language and Linguistics. They include dictionaries for different time periods but also for written and oral language use. In this case, using text from the Tale of Genji, we have to choose the dictionary for Japanese of the Heian period labelled 中古和文 (chūko wabun). For now, leave the rest of the settings as they are and click on “analyze” 解析する (kaiseki suru).

You will land on a page that shows which dictionary you selected and your text followed by a large table breaking your text down into words and grammatical elements. It also provides perhaps more information than you need about each item.

This is a screen grab of the results page on the WebChamame website on June 28, 2022. The

Starting from the left, it tells you which dictionary was used to analyze the item, in this case the 中古和文 dictionary. In the third column with the title “written form (=surface form)” 書字形(=表層形)shojikei (=hyōsōkei), you will see the word as it appears in the source text. The next two columns give you the dictionary form called “lexeme” 語彙素 goiso and its “reading” 語彙素読み goiso yomi. These are followed by a column listing the “part of speech” 品詞 hinshi, that is noun, pronoun, verb, etc. One column entitled “conjugation pattern” 活用型 katsuyōgata even gives you specific information about each verb conjugation. Another interesting column is the third from the right entitled “word classification” 語種 goshu, which tells you if the word is of Japanese or Chinese origin.

Try putting in other texts from the Japanese Text Initiative or another source. To make sure your text is parsed properly, though, be sure to choose the right dictionary. If you’re not sure which dictionary is right, you can select a few and compare the results. You can also have WebChamame produce your results in the form of an Excel file, which will automatically download to your computer.

This is an illustration from Tsutsumi and Ogiso (2015) of what WebChamame does behind the scenes to produce the results you see on your screen. Note that it uses MeCab to do the morphological analysis.

WebChamame does all of the analysis for you, but that does not mean it is the right tool for every task. For those of us interested in natural language processing to compute statistics about Japanese texts, it is rather inconvenient to pore over Excel files.

A lot of statisticians use the computer language R to analyze data, and it is also possible to analyze Japanese texts using R. To do that, however, you not only need to be able to use R. You also need to install MeCab and RMeCab on your computer, and that will require another post.


Den, Yasuharu, Toshinobu Ogiso, Hideki Oguro, Atsushi Yamada, Nobuaki Minematsu, Kiyotaka Uchimoto, and Hanae Koiso. 2007. “Kōpasu Nihongo no tame no gengo shigen: Keitaisokaisekiyō denshika jisho no kaihatsu to sono ōyō.” Nihongo kagaku 22 (October): 101–23.

Tsutsumi, Tomoaki, and Toshinobu Ogiso. 2015. “Rekishiteki shiryō wo taisho to shita fukusū no UniDic jisho ni yoru keitaisokaiseki shien tsūru ‘WebChamame.’” In Jinbun kagaku to konpyūta shinpojiumu.


The Maternity Shrine

This post caught my attention now that I am a mother. I certainly would no longer simply consider this shrine creepy like I did in 2011. Something else must have fascinated and saddened me about it then, too. There is so much hope and longing for a child and for a safe delivery out there. I wish I had visited again when I was pregnant.

Edited on January 26, 2022.

Photo by H. McGaughey

Kyoto, Japan I took this picture on a neighborhood tour near Kamigamo Shrine in Kyoto. This shrine is located in what looks like the garden of a private home. I did not catch the whole explanation, and I can’t find any information online, because I don’t know the name of this shrine. So, here is the story as I remember it told by the guide.


Farewell to a Friend

Taken near the Komaba campus of the University of Tokyo. Photo by H. McGaughey

Tokyo, Japan Reading a academic article today, I came across this poem by Retired Emperor Go-toba written when his loyal courtier Fujiwara no Ietaka was about to leave the island where Go-toba was exiled.


Sweets Wrapped in Oak Leaves

Edited on January 31, 2022.

Kashiwa mochi, sweets wrapped in oak leaves for children’s day. Photo by H. McGaughey

Yokohama, Japan Today, May 5 or 5/5, is Children’s Day, one of the string of national holidays this week. These holidays are collectively known as Golden Week, which doesn’t mean much to me, because my academic work doesn’t end, but it’s really nice to see people enjoying themselves at the neighborhood park or among the crowds in Shibuya, where I ran errands yesterday.

I didn’t realize until I looked it up just now that today is called Children’s Day and is supposed to be for both genders as of 1948. I thought that strange, because I somehow thought it was Boys’ Day, considering the images of Kintaro (the golden boy) and kabuto (samurai helmets) that contrast with the dolls of Girl’s Day, celebrated on March 3 or 3/3. Politicians can change the name, but they can’t change traditional festivals, I guess. It’s just unfortunate that in effect, “children” means boys today. I’m all for the boys having a festival, but call it what it is!


A Pilgrimage to Kumano

Lightly edited on January 31, 2022.

A stretch of the Kumano pilgrimage trails near Nachi Taisha. Photo by H. McGaughey

Yokohama, Japan At the end of last summer, which ended in late September for me on the Japanese academic calendar, I realized I had not taken advantage of my free time and decided to leave the Tokyo metropolis on a little trip. Photographs by a friend of mine who had been to Kumano earlier in the year had caught my fancy, and combined with the significance of Kumano as a pilgrimage destination in the Japanese middle ages, I thought it a suitable place to go.

I went for a total of two nights, staying at an onsen resort on an off-season, no-meals attached rate. The complex was in a small valley surrounded by greenery, which was a beautiful respite after a hot summer in the city. The day I arrived, the weather was rainy, and the forests and mountains were interwoven with low clouds that snaked through valleys and between trees like dragons.