Text Mining

Troubleshooting an Install with ChatGPT

Thank you, ChatGPT!!!

Screenshot of my conversation with ChatGPT troubleshooting my installation of the Neologd dictionary for MeCab. Please don’t judge me on how elementary my question is! ๐Ÿ˜†

I have been suspicious of ChatGPT, considering aspects that Chomsky and his co-authors mentioned in a NY Times article.

Then two weeks ago I got really stuck using Terminal on my Mac to install a neologism dictionary for processing Japanese called neologd. I may be a digital humanist, but I really prefer the theory to the actual computing. It can be so hard to get a computer to work the way I want it to! ๐Ÿ˜…

Googling all my questions didn’t produce answers specific to my problem. Asking for help in the Github Issue forum for the dictionary didn’t get me much of a response. Then, one night as I was falling asleep I remembered reports of people saying they had used ChatGPT to check code.

Success! I could ask ChatGPT all of my newbie questions and didn’t have to worry about taking up someone’s time. Sure, not all of the answers were accurate, but working with ChatGPT eventually helped me identify my problem and fix it.

To follow my process of frustration and a summary of the results, see the Github Issue I opened for neologd two weeks ago and closed just earlier.

While some might say IT helpdesk personell might be out of a job, for computing problems with open-source software like this dictionary, ChatGPT can be a really useful tool for troubleshooting as well as for learning computing structures and code.

Text Mining

Installing MeCab and RMeCab on a Mac

Since my last post, you’ve maybe played around with WebChamame a bit. That site is a powerful tool to see what’s possible in natural language processing, which is a method used in linguistics and other social sciences to create statistics from texts. Although with WebChamame you are able to determine a number of settings and output your data into downloadable files, however, the site has its limitations. Most prominent among them: it takes a long time to find the bits of information you might be interested in.

Statisticians created the computer language R and the editor RStudio to overcome that problem because they need to wrangle data quickly to produce useful information. It is also possible to tokenize premodern Japanese texts in R and RStudio, but it requires installing a program for morphological analysis on your own computer. As you might have noticed in the last post, WebChamame uses MeCab for its morphological analyses.

This is an illustration from Tsutsumi and Ogiso (2015) of WebChamame’s workflow.
Text Mining

Breaking Digital Premodern Japanese Texts Down into Words using WebChamame

If you want to analyze Japanese texts digitally, the first problem you might run up against is that Japanese does not use spaces between words. A computer needs those spaces to know when one word ends and the next begins. So, you first need to be able to “tokenize” those words, that is determine the words. Deciding what is a word and what is not is difficult to decide. Is a verb ending a word or a part of a word? Linguists discuss these kinds of questions for us literary scholars and have created the necessary tools so that we don’t have to insert spaces manually into a text. Imagine how much work that would be!

This is an image from Den, et al., 2007, illustrating the different levels of differentiation between words in modern Japanese, taking into account complex composite nouns and verbs, in this case verbs combining a noun with the irregular verb “to do” (suru).

One way to see how computers can tokenize words (without installing anything on your own computer) is to use WebChamame. This site was built by researchers at the National Institute of Japanese Language and Linguistics. To get started, type a Japanese sentence into the left window on the WebChamame site.

This is a screengrab of the WebChamame website on June 28, 2022. Type your text into the box under ใ€Œใƒ†ใ‚ญใ‚นใƒˆใ‚’ๅ…ฅๅŠ›ใ€ to see what this site can do.