DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

Kyoto University
DKDS dataset is the first collection of degraded ancient Japanese document images specifically designed to address the challenge of Kuzushiji characters overlapping with seals. Based on the dataset, we define two benchmark tracks: (1) Text and Seal Detection, and (2) Document Binarization.

Motivation

In ancient Japanese documents, seals often appear as red marks that overlap with Kuzushiji characters. Such overlap can blur characters and substantially reduce OCR accuracy. From left to right, the observed OCR errors include recognition of extra character, recognition of incorrect character, and misclassification of seal inscriptions as text. OCR was conducted using the “miwo” application [Clanuwat et al. 2021].

Challenges in Track 1: Text and Seal Detection

Text and seal detection serves as a crucial preliminary step for subsequent Kuzushiji OCR and seal analysis. However, this task is challenging because seals may (a) suffer from ink fading or (b) overlap with Kuzushiji characters or other seals, which often leads to reduced detection accuracy.

Challenges in Track 2: Document Binarization

Document binarization aims to improve the accuracy of downstream OCR systems. In this task, the objective is to remove seals while preserving, or even restoring, Kuzushiji characters as much as possible. This process becomes particularly challenging when the Kuzushiji characters overlap with seals.

Workflow

The overall workflow for constructing the proposed DKDS dataset includes detection annotations, initial binarization ground-truth generation, verification, and manual correction. For detection annotations, bounding box information for Kuzushiji characters was sourced from the OCR annotations provided by CODH [国文学研究資料館], while bounding boxes for the seals were recorded during the process of randomly adding them to the Kuzushiji document images. The initial binarization ground-truth was generated following the method of [Ju et al., 2024], and verification was performed by a trained Kuzushiji expert.

Reference

[Clanuwat et al. 2021] Clanuwat, Tarin, and Asanobu Kitamoto. `miwo' AI Kuzushiji Recognition Application for Document Examination. Proceeding of IPSJ Humanities and Computer Symposium. 2021.

[国文学研究資料館] 国文学研究資料館. 日本古典籍くずし字データセット. https://doi.org/10.20676/00000340.

BibTeX

If you use our dataset, please cite the paper below: