Nowadays, vision editing systems are widely used in daily life. Despite lots of demand, digital design tools such as Photoshop or Premiere require specific prior knowledge and complex operations, which makes it difficult for novices to kick off. In contrast, language is the most natural way of communication. If a system can utilize given instructions and automatically perform related editing actions, it will significantly facilitate accessibility and meet the considerable need. This dissertation presents our research trust in controllable visual editing via natural language, which connects text understanding and visual generation to benefit practical usage. While data-driven learning has proven effective, gathering numerous pairs of input-result images is still laborious. Moreover, obtaining crucial instructions is equally challenging. To overcome this data scarcity issue, we integrate counterfactual thinking and mimic human iterative editing through self-supervised reasoning. In addition, we study how to perceive style patterns from visual attributes and human emotions, making artistic style transfer more attainable. Different from static images, processing videos is more challenging due to their dynamic motion with smooth temporal coherence. We then investigate video editing, which should change only the semantics but preserve the scenario. We explore the multi-level conveyance of videos to modify their properties or moving actions. With arbitrary frames, we develop a unified video completion system that can follow the instruction to generate the full video from any time point. Beyond images and videos, we take a step forward in natural visual manipulation. Specifically, we study two challenging tasks: 3D human generation and instruction-based editing for natural images. We propose an efficient fusion between textual descriptions and visual rendering to produce concrete 3D characters. We leverage latent visual knowledge from large language models to bridge the gap of instruction understanding for image editing. Our efforts shed light on generalizing visual editing to more diverse and practical scenarios. Finally, we summarize the contributions and implications of our work and discuss future directions toward this research field.