Data Mining Historical Manuscripts and Culture Artifacts
Initiatives such as the Google Print Library Project and the Million Book Project have already archived more than fifteen million books in digital format, over 11% of all the books ever published, and within the next decade the majority of world's books will be online. While this digitized collection will be an invaluable resource for researchers to browse and search, we feel that the additional step of mining these manuscripts will reveal new insights, knowledge and historical context. Although most of the data will naturally be text, there will also be tens of millions of pages of images, many in color. In this dissertation, we introduce a simple color measure which both addresses and exploits typical features of historical manuscripts. To enable the efficient mining of massive archives, we propose a tight lower bound to the measure. Beyond the fast similarity search, we show how this lower bound allows us to build several higher-level data mining tools, including motif discovery and link analyses. We demonstrate our ideas in several data mining tasks on manuscripts dating back to the fifteenth century.
Compared to the well preserved or already digitized historical manuscripts, there is another category of cultural heritage, rock art, which requires urgent action in order to be explored and archived for prosperity. Rock art is an archaeological term for human-made markings on stone, including carvings into stone surfaces (petroglyphs) and paintings on stone (pictographs). It is believed that there are millions of petroglyphs in North America alone, and the study of this valued cultural resource has implications even beyond anthropology and history. Surprisingly, although image processing, information retrieval and data mining have had large impacts on many human endeavors, they have had essentially zero impact on the study of rock art. In this dissertation, we consider, for the first time, the problem of data mining large collections of rock art. We propose a robust distance measure for unconstrained and complex shapes, a cheap-to-compute tight lower bound, and algorithms based on these two ideas which enable very fast query-by-content and make the otherwise intractable data mining tasks in large collections of rock art tenable.