The general assumption in reinforcement learning(RL) that agents are free to explore for searching optimal policies limits its applicability in real-world domains where safe exploration is desired. In this paper, we study the problem of constrained RL in episodic MDPs to investigate efficient exploration in safe RL. We formally describe two different constraint schemes frequently considered in empirical studies --- namely, soft constrained RL that focuses on the overall safety satisfaction, and hard constrained RL that aims to provide guarantees throughout learning. While violations may occur in the former scheme, the latter enforces safety by extending the challenging knapsack problem in multi-armed bandits. Accordingly, we propose two novel sample efficient constrained Q-learning algorithms. By balancing exploration and exploitation based on UCB, our methods reduce the notoriously high sample complexity in constrained model-free settings while achieving asymptotically optimal solutions. Our theoretical analyses establish promising regret bounds for both algorithms.