UD_Japanese-CEJC: Dependency Relation Annotation on Corpus of Everyday Japanese Conversation
Mai Omura, Hiroshi Matsuda, Masayuki Asahara, Aya Wakasa
Abstract:
In this study, we have developed Universal Dependencies (UD) resources for spoken Japanese in the Corpus of Everyday Japanese Conversation (CEJC). The CEJC is a large corpus of spoken language that encompasses various everyday conversations in Japanese, and includes word delimitation and part-of-speech annotation. We have newly annotated Long Word Unit delimitation and Bunsetsu (Japanese phrase)-based dependencies, including Bunsetsu boundaries, for CEJC. The UD of Japanese resources was constructed in accordance with hand-maintained conversion rules from the CEJC with two types of word delimitation, part-of-speech tags and Bunsetsu-based syntactic dependency relations. Furthermore, we examined various issues pertaining to the construction of UD in the CEJC by comparing it with the written Japanese corpus and evaluating UD parsing accuracy.