DKDS Dataset

Motivation

In ancient Japanese documents, seals often appear as red marks that overlap with Kuzushiji characters. Such overlap can blur characters and substantially reduce OCR accuracy. From left to right, the observed OCR errors include recognition of extra character, recognition of incorrect character, and misclassification of seal inscriptions as text. OCR was conducted using the “miwo” application [Clanuwat et al. 2021].

Challenges in Track 1: Kuzushiji Character and Seal Detection

Kuzushiji Character and seal detection serves as a crucial preliminary step for subsequent Kuzushiji OCR and seal analysis. However, this task is challenging because seals may suffer from ink fading or overlap with Kuzushiji characters or other seals, which often leads to reduced detection accuracy.

Challenges in Track 2: Document Binarization

Document binarization aims to improve the accuracy of downstream OCR systems. In this task, the objective is to remove seals while preserving, or even restoring, Kuzushiji characters as much as possible. This process becomes particularly challenging when the Kuzushiji characters overlap with seals.

Workflow of The Dataset Construction

The overall workflow for constructing the proposed DKDS dataset includes detection annotations, initial binarization ground-truth generation, verification, and manual correction. For detection annotations, bounding box information for Kuzushiji characters was sourced from the OCR annotations provided by CODH [国文学研究資料館], while bounding boxes for the seals were recorded during the process of randomly adding them to the Kuzushiji document images. The initial binarization ground-truth was generated following the method of [Ju et al., 2024], and verification was performed by a trained Kuzushiji expert.

Verification by a Trained Kuzushiji Expert

A trained Kuzushiji expert manually reviewed the initial outputs generated by the pre-trained binarization model. The preliminary binarization results frequently contained various types of errors, requiring the expert to identify and correct issues such as missing marginal annotations, voiced sound marks, punctuation symbols, and residual background stains. After manual refinement, these corrections were incorporated to produce the final binarization ground truth.

Reference

[Clanuwat et al. 2021] Clanuwat, Tarin, and Asanobu Kitamoto. `miwo' AI Kuzushiji Recognition Application for Document Examination. Proceeding of IPSJ Humanities and Computer Symposium. 2021.

[国文学研究資料館] 国文学研究資料館. 日本古典籍くずし字データセット. https://doi.org/10.20676/00000340.

BibTeX

If you use our dataset, please cite the paper below:

@article{ju2025dkds, title={DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization}, author={Ju, Rui-Yang and Yamashita, Kohei and Kameko, Hirotaka and Mori, Shinsuke}, journal={arXiv preprint arXiv:2511.09117}, year={2025} }

The following is the citation of the original Kuzushiji dataset; please cite it when using our benchmark dataset:

『日本古典籍くずし字データセット』（国文研所蔵／CODH加工） doi:10.20676/00000340