Finding the sparsest or minimum L0-norm representation of a signal given a (possibly) overcomplete dictionary of basis vectors is an important problem in many application domains, including neuroelectromagnetic source localization, compressed sensing, sparse component analysis, feature selection, image restoration/compression, and neural coding. Unfortunately, the required optimization is typically NP-hard, and so approximate procedures that succeed with high probability are sought. Nearly all current approaches to this problem, including orthogonal matching pursuit (OMP), basis pursuit (BP) (or the LASSO), and minimum Lp quasi-norm methods, can be viewed in Bayesian terms as performing standard MAP estimation using a fixed, sparsity-inducing prior. In contrast, we advocate empirical Bayesian approaches such as sparse Bayesian learning (SBL), which use a parameterized prior to encourage sparsity through a process called evidence maximization. We prove several results about the associated SBL cost function that elucidate its general behavior and provide solid theoretical justification for using it to find maximally sparse representations. Specifically, we show that the global SBL minimum is always achieved at the maximally sparse solution, unlike the BP cost function, while often possessing a more limited constellation of local minima than comparable MAP methods which share this property. We also derive conditions, dependent on the distribution of the nonzero model weights embedded in the optimal representation, such that SBL has no local minima. Finally, we demonstrate how a generalized form of SBL, out of a large class of latent-variable models, uniquely satisfies two minimal performance criteria directly linked to sparsity. These results lead to a deeper understanding of the connections between various Bayesian-inspired strategies and suggest new sparse learning algorithms. Several extensions of SBL are also considered for handling sparse representations that arise in spatio-temporal settings and in the context of covariance component estimation. Here we assume that a small set of common features underly the observed data collected over multiple instances. The theoretical properties of these SBL-based cost functions are examined and evaluated in the context of existing methods. The resulting algorithms display excellent performance on extremely large, ill-posed, and ill-conditioned problems in neuroimaging, suggesting a strong potential for impacting this field and others