Total: 1
We assess whether AI systems can credibly evaluate investment risk appetite—a task that must be thoroughly validated before automation. Our analysis was conducted on proprietary systems (GPT, Claude, Gemini) and open-weight models (LLaMA, DeepSeek, Mistral), using carefully curated user profiles that reflect real users with varying attributes such as country and gender. As a result, the models exhibit significant variance in score distributions when user attributes—such as country or gender—that should not influence risk computation are changed. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles. While some models align closely with expected scores in the low- and mid-risk ranges, none maintain consistent scores across regions and demographics, thereby violating AI and finance regulations.