MILC-Dslash is a benchmark that is derived from the MILC code which simulates lattice-gauge theory on a four-dimensional hypercube. This paper outlines a gradual progression in increasing the granularity of parallelism in the MILC-Dslash kernel using the SYCL programming model, transitioning from a simple to a fully parallel implementation. We explore the impact of various parallel strategies on the MILC-Dslash performance on an NVIDIA A100 GPU. This investigation encompasses different work-item index orders, work-group sizes, and memory access patterns that arise from these strategies. Examples of components intertwined with the parallel strategies include atomic memory operations, shared variables, divergent instructions, synchronization barrier, scenarios with and without dependencies between iterations, as well as versions with and without using the SYCL complex library (SyclCPLX) and the SYCLomatic tool. The best parallel strategy is twice as fast as the simplest strategy and shows a 10% improvement over the QUDA baseline, thanks to enhanced parallelism and the use of work-group local memory. This, along with other findings - such as optimizing GPU resource utilization even at the expense of concurrency, prioritizing the use of work-item indexing methods that favor more local-ized memory access patterns, and maximizing both the number of active work-items per warp and the sequence of successive active work-items - could provide valuable guidance for researchers and developers seeking to optimize parallel computing applications.