Thermal comfort in buildings is typically assessed through occupant surveys, especially for short-term thermal comfort. For long-term thermal comfort, thermal comfort standards and recent research suggest continuous physical monitoring of temperature is sufficient. However, a lack of formal rules for data representation in building automation systems and the high costs of analytical application development for buildings impede predicting long-term thermal comfort at scale. This paper demonstrates portable and reproducible application development techniques for evaluating long-term thermal comfort with the Brick metadata schema and Mortar data testbed. We take advantage of the relatively large Mortar dataset containing over 25 buildings to improve the generalizability of long-term thermal comfort evaluation. Previous research often performs analysis on limited datasets.The design of Mortar enables running the same software applications across many heterogeneous buildings, simplifying building analytics application development, and acting as a vehicle for reproducible evaluations in building science. To assess the efficacy of this workflow, we identify six air temperature- based long-term thermal comfort evaluation metrics from the literature and implement them in software. The six indices are temperature mean index, temperature variance index, degree hours index, range outlier index, daily range outlier index, and combined outlier index. During the application development, we find that the calculation of threshold in the daily range outlier index is arbitrary, and the months belonging to cooling and heating seasons with different comfortable temperature ranges are unclear. Also, all long-term thermal comfort indices fail to differentiate between tool hot and too cold. To address this, we develop two new metrics to calculate overheating and overcooling separately. We evaluate our software across all the buildings available in the Mortar testbed. The result shows that 25 buildings with 1953 thermal zones have qualified air temperature sensor data during building occupancy. Based on this building dataset, we analyze Pearson correlation among long-term thermal comfort indices. The range outlier index has a 0.19 Pearson correlation coefficient with the daily range outlier index, compared with the Pearson correlation coefficient of-0.35 at a randomly selected building in Mortar. The opposite result indicates that a small building dataset is not capable of long-term thermal comfort indices development, generating misleading results. With the help of the uniform Brick metadata schema, we also investigate disaggregating the results by buildings, floors, zones, and equipment. We summarize them as a means of identifying problem areas and equipment.