In this paper, we developed three ACT-R cognitive models to simulate the learning process of abacus gestures. Abacus gestures are mid-air gestures, each representing a number between 0 and 99. Our models learn to predict the response time of making an abacus gesture. We found the accuracy of a model's predictions depends on the structure of its declarative memory. A model with 100 chunks cannot simulate human response, whereas models using fewer chunks can, as segmenting chunks increase both the frequency and recency of information retrieval. Furthermore, our findings suggest that the mind is more likely to represent abacus gestures by dividing attention between two hands rather than memorizing and outputting all gestures directly. These insights have important implications for future research in cognitive science and human-computer interaction, particularly in developing vision and motor modules for mental states in existing cognitive architectures and designing intuitive and efficient mid-air gesture interfaces.