Total: 1
We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of $\tilde{\mathcal{O}}\big(V_{g,\max}K^{3/4} + \sqrt{H^4 S^2 A \log(1/\delta)}K^{3/4}\big)$ and constraint violation of $\tilde{\mathcal{O}}\big(V_{g,\max} \sqrt{ {H^3 S^2 A \log(1/\delta)}}K^{3/4} \big)$ with probability at least $1-\delta$, where $S$ and $A$ denote the cardinalities of the state and action spaces, respectively, $H$ is the episode length, $K$ is the number of episodes, $\alpha < 0$ is the risk-aversion parameter, and $V_{g,\max} = \frac{1}{|\alpha|}(\exp(|\alpha|H) - 1)$. *To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.*