Accurate lighting estimation is crucial for enhancing realism in virtual environments used in augmented reality, virtual reality, and film production, ensuring seamless integration of virtual objects into real-world scenes. While traditional far-field lighting representations, such as environment maps, face challenges in capturing near-field lighting nuances, recent advancements have leveraged deep learning and inverse rendering methods to predict per-pixel environment maps, volumes, or emitters. These techniques, though effective for tasks like object insertion, often either lack editability for dynamic lighting adjustments, or are hindered by high computational costs and ambiguities between reflection and emission. Here, we explore fast near-field lighting estimation from the perspective of point light position prediction. Specifically, we train a vision transformer as a regressor to predict point light position given a single observed image. We provide two alternatives to over-parameterize the target by representing point lights as rays corresponding to image patches, which is later jointly processed by a diffusion vision transformer, offering an editable and neural network-friendly representation. Our approach is trained and evaluated on a custom dataset derived from OpenRooms, featuring 259 scenes with diverse lighting conditions, to comprehensively assess our method's effectiveness. Quantitative and qualitative experiment results show that our representations outperform naive end-to-end model merely outputing 3D positions. The positions predicted by our models deviate from the ground truth by around 0.35 and 0.38 of the scene scale, in contrast to the naive position prediction method which achieves around 0.60, all trained on the first 200 scenes in our dataset.