Retrieval success during retrieval practice has been recognised as an important factor to enhancing long-term memory. The grain size of testing hypothesis therefore proposes that several interim tests of smaller amounts of information interspersed throughout learning should result in better retention than a single test at the end of learning. However, previous research has found that although interim tests result in better practice performance than end tests, this does not translate into an advantage at a final test when using complex materials. We evaluated the grain size hypothesis using lists of related (Experiment 1) and unrelated (Experiment 2) words and found that interim tests enhanced both practice and final test performance. However, we still observed considerable forgetting of successfully recalled items in the interim test group in the final test. A theoretical framework, explicitly tested in Experiment 3, suggests that desirable difficulty may be an important factor for long-term retention.