Increasing design complexity and diminishing marginal utility of monolithic processor designs has resulted in integration of multiple loosely-coupled processing cores on the same die. However, fundamental questions remain about the right form, implementation, and methodology for multi-core designs. This thesis addresses these questions. A popular methodology for designing a multi-core architecture is to replicate an off-the-shelf core design multiple times, and then connect the cores together using an interconnect mechanism. However, this methodology is "multi-core oblivious" as subsystems are designed/ optimized unaware of the overall chip-multiprocessing system they would become parts of. This thesis demonstrates that this methodology is very inefficient in terms of area/power, and recommends a holistic approach where the subsystems are designed from the ground up as different components of a full system. Inefficiency in "multi-core oblivious" multi-core designs comes at different levels. Having multiple replicated cores results in an inability to adapt to the demands of execution workloads, and results in either underutilization or overutilization of processor resources. This thesis proposes single-ISA (instruction-set architecture) heterogeneous multi-core architectures where the die hosts cores of varying power/performance characteristics, but all capable of running the same ISA. Such a processor can result in significant power savings and performance improvements if the applications are mapped to cores judiciously. The thesis also presents holistic design methodologies for such architectures. Another source of inefficiency is blind replication of over-provisioned hardware structures. To that effect, the thesis proposes conjoined-core chip multiprocessing where the adjacent cores of a multi-core architecture share some resources. The thesis shows that this can result in significant area savings without much performance degradation. The thesis also proposes novel optimizations for minimizing the already small degradation. Yet another source of inefficiency is the interconnection. This thesis shows that the interconnection overheads can be very significant for a "multi-core oblivious" multi-core design -- especially as the number of cores increases and the pipelines get deeper. The thesis demonstrates the need to co-design the cores, the memory and the interconnection to obviate the inefficiency problem, and also makes several suggestions regarding co-design