Towards understanding text-data connection in documents through the lens of data operations
While the world is relying more on data, text descriptions of data inevitably become more prevalent. These descriptions synthesize data and highlight important things from the data for readers to better understand key takeaways. Obviously, there is a connection between data and text as it is a representation of information from the data. However, despite the surge of AI and data management research, studies on the connection between them are still lacking. Understanding the connection would not only streamline the work involving both components, but also introduce novel interaction techniques between them. Therefore, this work aims to develop a better comprehension of the connection by focusing on how each phrase in the sentence is formed given clues from the rest of the sentence and the associated data. We collected data-rich documents and investigated how people describe data in natural language. We found that this problem is complicated because it incorporates two difficult subproblems - language and data management problems. Also, each phrase inference can be viewed as a series of data operations that can be traced back to language. Thus, we propose a taxonomy of Language-Inferred Data Operations (LIDOs) based on our collected dataset. In addition, we propose Data-Language Inference Framework (DLIF), a conceptual framework that eases the phrase prediction process by deconstructing this complex problem into five simpler steps. Examples of DLIF applications are shown with real datasets to illustrate how DLIF works.