If you want to analyze Japanese texts digitally, the first problem you might run up against is that Japanese does not use spaces between words. A computer needs those spaces to know when one word ends and the next begins. So, you first need to be able to “tokenize” those words, that is determine the words. Deciding what is a word and what is not is difficult to decide. Is a verb ending a word or a part of a word? Linguists discuss these kinds of questions for us literary scholars and have created the necessary tools so that we don’t have to insert spaces manually into a text. Imagine how much work that would be!
One way to see how computers can tokenize words (without installing anything on your own computer) is to use WebChamame. This site was built by researchers at the National Institute of Japanese Language and Linguistics. To get started, type a Japanese sentence into the left window on the WebChamame site.