Hybrid floating-point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations with USL support increase software FP throughput per core by 2.18\times for addition/subtraction, 1.29\times for multiplication, 3.07-4.05\times for division, and 3.11-3.81\times for square root, and use 90.7-94.6% less area than dedicated fused multiply-Add (FMA) hardware. Hybrid implementations with custom FP-specific hardware increase throughput per core over a fixed-point software kernel by 3.69-7.28\times for addition/subtraction, 1.22-2.03\times for multiplication, 14.4\times for division, and 31.9\times for square root, and use 77.3-97.0% less area than dedicated FMA hardware. The circuit area and throughput are found for 38 multiply-Add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. Thirty-Three multiply-Add implementations are presented, which improve throughput per core versus a fixed-point software implementation by 1.11-15.9\times and use 38.2-95.3% less area than dedicated FMA hardware.