2024-10-29 | | Total: 39
Human image datasets used to develop and evaluate technology should represent the diversity of human phenotypes, including skin tone. Datasets that include skin tone information frequently rely on manual skin tone ratings based on the Fitzpatrick Skin Type (FST) or the Monk Skin Tone (MST) scales in lieu of the actual measured skin tone of the image dataset subjects. However, perceived skin tone is subject to known biases and skin tone appearance in digital images can vary substantially depending on the capture camera and environment, confounding manual ratings. Surprisingly, the relationship between skin-tone ratings and measured skin tone has not been explored. To close this research gap, we measured the relationship between skin tone ratings from existing scales (FST, MST) and skin tone values measured by a calibrated colorimeter. We also propose and assess a novel Colorimetric Skin Tone (CST) scale developed based on prior colorimetric measurements. Using experiments requiring humans to rate their own skin tone and the skin tone of subjects in images, we show that the new CST scale is more sensitive, consistent, and colorimetrically accurate. While skin tone ratings appeared to correct for some color variation across images, they introduced biases related to race and other factors. These biases must be considered before using manual skin-tone ratings in technology evaluations or for engineering decisions.
Assessing employees' well-being has become central to fostering an environment where employees can thrive and contribute to companies' adaptability and competitiveness in the market. Traditional methods for assessing well-being often face significant challenges, with a major issue being the lack of trust and confidence employees may have in these processes. Employees may hesitate to provide honest feedback due to concerns not only about data integrity and confidentiality, but also about power imbalances among stakeholders. In this context, blockchain-based decentralised surveys, leveraging the immutability, transparency, and pseudo-anonymity of blockchain technology, offer significant improvements in aligning responsive actions with employees' feedback securely and transparently. Nevertheless, their implementation raises complex issues regarding the balance between trust and confidence. While blockchain can function as a confidence machine for data processing and management, it does not inherently address the equally important cultural element of trust. To effectively integrate blockchain technology into well-being assessments, decentralised well-being surveys must be supported by cultural practices that build and sustain trust. Drawing on blockchain technology management and relational cultural theory, we explain how trust-building can be achieved through the co-production of decentralised well-being surveys, which helps address power imbalances between the implementation team and stakeholders. Our goal is to provide a dual cultural-technological framework along with conceptual clarity on how the technological implementation of confidence can connect with the cultural development of trust, ensuring that blockchain-based decentralised well-being surveys are not only secure and reliable but also perceived as trustworthy vector to improve workplace conditions.
A century ago, John Dewey observed that '[s]team and electricity have done more to alter the conditions under which men associate together than all the agencies which affected human relationships before our time'. In the last few decades, computing technologies have had a similar effect. Political philosophy's central task is to help us decide how to live together, by analysing our social relations, diagnosing their failings, and articulating ideals to guide their revision. But these profound social changes have left scarcely a dent in the model of social relations that (analytical) political philosophers assume. This essay aims to reverse that trend. It first builds a model of our novel social relations as they are now, and as they are likely to evolved, and then explores how those differences affect our theories of how to live together. I introduce the 'Algorithmic City', the network of algorithmically-mediated social relations, then characterise the intermediary power by which it is governed. I show how algorithmic governance raises new challenges for political philosophy concerning the justification of authority, the foundations of procedural legitimacy, and the possibility of justificatory neutrality.
Algorithmic intermediaries govern the digital public sphere through their architectures, amplification algorithms, and moderation practices. In doing so, they shape public communication and distribute attention in ways that were previously infeasible with such subtlety, speed and scale. From misinformation and affective polarisation to hate speech and radicalisation, the many pathologies of the digital public sphere attest that they could do so better. But what ideals should they aim at? Political philosophy should be able to help, but existing theories typically assume that a healthy public sphere will spontaneously emerge if only we get the boundaries of free expression right. They offer little guidance on how to intentionally constitute the digital public sphere. In addition to these theories focused on expression, we need a further theory of communicative justice, targeted specifically at the algorithmic intermediaries that shape communication and distribute attention. This lecture argues that political philosophy urgently owes an account of how to govern communication in the digital public sphere, and introduces and defends a democratic egalitarian theory of communicative justice.
India produces about nine hundred thousand (900K) engineers annually, and many seek computer science and related technology jobs. Given that the IT workforce in India is still young, new graduates get jobs only when the industry grows. A liberal estimate based on the data from MeitY (Ministry of Electronics and Information Technology) and NASSCOM puts the annual job growth to three hundred thousand (300K), less than one-third of the graduation rate. In other words, about half a million graduates don't get a job every year (even when we consider that some students don't opt for jobs or go for higher studies). This position paper demonstrates that given the current growth rate of the Indian economy, such a significant shortfall will continue to exist. It then proposes a way to address this shortfall. The paper proposes to develop micro-entrepreneurs at scale, enabling many graduates to start micro-enterprises focused on AI, Software, and Technology (MAST). These MAST enterprises offer technology products and services to meet the hyperlocal needs of the businesses and individuals in the local community (a retailer in the neighborhood, a high net-worth person, or a factory). Such an endeavor will require curricular, policy, and societal interventions. The paper presents an approach to enable MAST education across campuses, outlining the key curricular changes required and important policies that must be created and implemented. This supply-demand gap is an existential problem for engineering education in India, and this position paper aims to trigger debates and collaborations to devise solutions that will work at India scale.
This study utilizes neural networks to evaluate the 2024 judicial reform in Mexico, a proposal designed to overhaul the judicial system by increasing transparency, judicial autonomy, and introducing the popular election of judges. The neural network model analyzes both converging and diverging factors that influence the reforms viability and public acceptance. Key areas of convergence include enhanced transparency and judicial autonomy, which are seen as improvements to the system. However, major points of divergence, such as the high costs of implementation and concerns about the legitimacy of electing judges, pose significant challenges. By integrating variables like transparency, decision quality, judicial independence, and implementation costs, the model predicts levels of public and professional acceptance of the reform. The neural networks multilayered structure allows for the modeling of complex relationships, offering predictive insights into how the reform may impact the Mexican judicial system. Initial findings suggest that while the reform could strengthen judicial autonomy, the risks of politicizing the judiciary and the financial burden it entails may reduce its overall acceptance. This research highlights the importance of using advanced AI tools to simulate public policy outcomes, providing valuable data to guide lawmakers in refining their proposals.
Decentralized Metaverses, built on Web 3.0 and Web 4.0 technologies, have attracted significant attention across various fields. This innovation leverages blockchain, Decentralized Autonomous Organizations (DAOs), Extended Reality (XR) and advanced technologies to create immersive and interconnected digital environments that mirror the real world. This article delves into the Metaverse of Everything (MoE), a platform that fuses the Metaverse concept with the Internet of Everything (IoE), an advanced version of the Internet of Things (IoT) that connects not only physical devices but also people, data and processes within a networked environment. Thus, the MoE integrates generated data and virtual entities, creating an extensive network of interconnected components. This article seeks to advance current MoE, examining decentralization and the application of Opportunistic Edge Computing (OEC) for interactions with surrounding IoT devices and IoE entities. Moreover, it outlines the main challenges to guide researchers and businesses towards building a future cyber-resilient opportunistic MoE.
Infrastructure maintenance is inherently complex, especially for widely dispersed transport systems like roads and railroads. Maintaining this infrastructure involves multiple partners working together to ensure safe, efficient upkeep that meets technical and safety standards, with timely materials and budget adherence. Traditionally, these requirements are managed on paper, with each contract step checked manually. Smart contracts, based on blockchain distributed ledger technology, offer a new approach. Distributed ledgers facilitate secure, transparent transactions, enabling decentralized agreements where contract terms automatically execute when conditions are met. Beyond financial transactions, blockchains can track complex agreements, recording each stage of contract fulfillment between multiple parties. A smart contract is a set of coded rules stored on the blockchain that automatically executes each term upon meeting specified conditions. In infrastructure maintenance, this enables end-to-end automation-from contractor assignment to maintenance completion. Using an immutable, decentralized record, contract terms and statuses are transparent to all parties, enhancing trust and efficiency. Creating smart contracts for infrastructure requires a comprehensive understanding of procedural workflows to foresee all requirements and liabilities. This workflow includes continuous infrastructure monitoring through a dynamic, data-driven maintenance model that triggers necessary actions. Modern process mining can develop a resilient Maintenance Process Model, helping Operations Management to define contract terms, including asset allocation, logistics, materials, and skill requirements. Automation and reliable data quality across the procedural chain are essential, supported by IoT sensors, big data analytics, predictive maintenance, intelligent logistics, and asset management.
The significance of open data in higher education stems from the changing tendencies towards open science, and open research in higher education encourages new ways of making scientific inquiry more transparent, collaborative and accessible. This study focuses on the critical role of open data stewards in this transition, essential for managing and disseminating research data effectively in universities, while it also highlights the increasing demand for structured training and professional policies for data stewards in academic settings. Building upon this context, the paper investigates the essential skills and competences required for effective data stewardship in higher education institutions by elaborating on a critical literature review, coupled with practical engagement in open data stewardship at universities, provided insights into the roles and responsibilities of data stewards. In response to these identified needs, the paper proposes a structured training framework and comprehensive curriculum for data stewardship, a direct response to the gaps identified in the literature. It addresses five key competence categories for open data stewards, aligning them with current trends and essential skills and knowledge in the field. By advocating for a structured approach to data stewardship education, this work sets the foundation for improved data management in universities and serves as a critical step towards professionalizing the role of data stewards in higher education. The emphasis on the role of open data stewards is expected to advance data accessibility and sharing practices, fostering increased transparency, collaboration, and innovation in academic research. This approach contributes to the evolution of universities into open ecosystems, where there is free flow of data for global education and research advancement.
Early identification of Autism Spectrum Disorder (ASD) is considered critical for effective intervention to mitigate emotional, financial and societal burdens. Although ASD belongs to a group of neurodevelopmental disabilities that are not curable, researchers agree that targeted interventions during childhood can drastically improve the overall well-being of individuals. However, conventional ASD detection methods such as screening tests, are often costly and time-consuming. This study presents a novel semi-supervised approach for ASD detection using AutoEncoder-based Machine Learning (ML) methods due to the challenge of obtaining ground truth labels for the associated task. Our approach utilizes data collected manually through a serious game specifically designed for this purpose. Since the sensitive data collected by the gamified application are susceptible to privacy leakage, we developed a Federated Learning (FL) framework that can enhance user privacy without compromising the overall performance of the ML models. The framework is further enhanced with Fully Homomorphic Encryption (FHE) during model aggregation to minimize the possibility of inference attacks and client selection mechanisms as well as state-of-the-art aggregators to improve the model's predictive accuracy. Our results demonstrate that semi-supervised FL can effectively predict an ASD risk indicator for each case while simultaneously addressing privacy concerns.
This report introduces the Grant Maturity Index (GMI), a novel evaluative framework designed to assess the maturity and operational effectiveness of Web3 grant programs. As Web3 continues to develop, the decentralized nature of these programs brings both opportunities and challenges, particularly when it comes to governance, transparency, and community engagement. Traditional funding models are often governed by standardized processes, but Web3 grants lack such consistency, making it difficult for grant operators to measure the long-term success of their programs.The Grant Maturity Index (GMI) was created through exploratory applied research to address this gap. Inspired by the World Bank's GovTech Maturity Index (GTMI), the GMI is tailored specifically for the decentralized Web3 ecosystem. The GMI evaluates key dimensions of grant programs governance, transparency, operational efficiency, and community engagement, providing grant operators with a clear benchmark for assessing and improving their programs. The primary objectives of this research are to, first, identify the structural indicators that adequately describe Web3 grant programs. Second, to describe optimal outcomes for programs by evaluating their maturity across key operational areas. The GMI is applied to four major Ethereum Layer 2 grant programs, namely Arbitrum, Mantle, Taiko Labs, and Optimism. These case studies highlight areas where Web3 grant programs require improvement, particularly in standardizing processes, enhancing transparency, and increasing community participation.
As artificial intelligence (AI) becomes more integrated into educational environments, how can we ensure that these systems are both understandable and trustworthy? The growing demand for explainability in AI systems is a critical area of focus. This paper explores Human-Centric eXplainable AI (HCXAI) in the educational landscape, emphasizing its role in enhancing learning outcomes, fostering trust among users, and ensuring transparency in AI-driven tools, particularly through the innovative use of large language models (LLMs). What challenges arise in the implementation of explainable AI in educational contexts? This paper analyzes these challenges, addressing the complexities of AI models and the diverse needs of users. It outlines comprehensive frameworks for developing HCXAI systems that prioritize user understanding and engagement, ensuring that educators and students can effectively interact with these technologies. Furthermore, what steps can educators, developers, and policymakers take to create more effective, inclusive, and ethically responsible AI solutions in education? The paper provides targeted recommendations to address this question, highlighting the necessity of prioritizing explainability. By doing so, how can we leverage AI's transformative potential to foster equitable and engaging educational experiences that support diverse learners?
This whitepaper offers an overview of the ethical considerations surrounding research into or with large language models (LLMs). As LLMs become more integrated into widely used applications, their societal impact increases, bringing important ethical questions to the forefront. With a growing body of work examining the ethical development, deployment, and use of LLMs, this whitepaper provides a comprehensive and practical guide to best practices, designed to help those in research and in industry to uphold the highest ethical standards in their work.
ChatGPT, a large language model providing natural language responses, has become a powerful tool integrated into many people's daily routines. Despite its capabilities, the benefits it provides may not be equally distributed among individuals-a phenomenon referred to as the digital divide. Building upon prior literature, we propose two forms of digital divide in the generative AI adoption process: (i) the learning divide, capturing individuals' heterogeneous abilities to update their perceived utility of ChatGPT; and (ii) the utility divide, representing differences in individuals' actual utility gains per usage from ChatGPT. To evaluate these two divides, we develop a Bayesian learning model that incorporates demographic heterogeneities in both the utility and signal functions. Leveraging a six-month clickstream dataset, we estimate the model and find significant learning and utility divides across various demographic attributes. Surprisingly, lowereducated and non-white individuals derive higher utility gains from ChatGPT but learn about its utility at a slower rate. Furthermore, males, younger individuals, and those with an IT background not only derive higher utility per use from ChatGPT but also learn about its utility more rapidly. Besides, we document a phenomenon termed the belief trap, wherein users underestimate ChatGPT's utility, opt not to use the tool, and consequently lack new experiences to update their perceptions, leading to continued underutilization. We further demonstrate that the learning divide can significantly affect the probability of falling into the belief trap, another form of the digital divide in adoption outcomes (i.e., outcome divide); however, offering training programs can alleviate the belief trap and mitigate the divide.
Chatbots like ChatGPT are used for diverse purposes, ranging from resume writing to entertainment. These real-world applications are different from the institutional uses, such as resume screening or credit scoring, which have been the focus of much of AI research on fairness. Ensuring equitable treatment for all users in these first-person contexts is critical. In this work, we study "first-person fairness," which means fairness toward the chatbot user. This includes providing high-quality responses to all users regardless of their identity or background and avoiding harmful stereotypes. We propose a scalable, privacy-preserving method for evaluating one aspect of first-person fairness across a large, heterogeneous corpus of real-world chatbot interactions. Specifically, we assess potential bias linked to users' names, which can serve as proxies for demographic attributes like gender or race, in chatbot systems such as ChatGPT, which provide mechanisms for storing and using user names. Our method leverages a second language model to privately analyze name-sensitivity in the chatbot's responses. We verify the validity of these annotations through independent human evaluation. Further, we show that post-training interventions, including RL, significantly mitigate harmful stereotypes. Our approach also yields succinct descriptions of response differences across tasks. For instance, in the "writing a story" task, chatbot responses show a tendency to create protagonists whose gender matches the likely gender inferred from the user's name. Moreover, a pattern emerges where users with female-associated names receive responses with friendlier and simpler language slightly more often than users with male-associated names. Finally, we provide the system messages required for external researchers to further investigate ChatGPT's behavior with hypothetical user profiles.
Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.
The rise of online dating apps has transformed how individuals connect and seek companionship, with an increase in usage among older adults. While these platforms offer opportunities for emotional support and social connection, they also present significant challenges, including a concerning trend of online dating scams targeting this demographic. To address these issues, we conducted a semi-structured interview focused on the online dating experiences of older adults (65+). Initially, we conducted a pre-screening survey, followed by focused semi-structured interviews with 11 of the selected older adults. Through this study, we investigate older adults' security and privacy concerns, the significance of design elements and accessibility, and identify areas needing improvement. Our findings reveal challenges such as deceptive practices, including catfishing and fraud, concerns over disclosing sensitive information, non-inclusive app design features, and the need for more informative visualization of match requests. We offer recommendations for enhanced identity verification, inclusive privacy controls by app developers, and increased digital literacy efforts to enable older adults to navigate these platforms safely and confidently.
Supervised Learning is a way of developing Artificial Intelligence systems in which a computer algorithm is trained on labeled data inputs. Effectiveness of a Supervised Learning algorithm is determined by its performance on a given dataset for a particular problem. In case of Supervised Learning problems, Stacking Ensembles usually perform better than individual classifiers due to their generalization ability. Stacking Ensembles combine predictions from multiple Machine Learning algorithms to make final predictions. Inspite of Stacking Ensembles superior performance, the overhead of Stacking Ensembles such as high cost, resources, time, and lack of explainability create challenges in real-life applications. This paper shows how we can strike a balance between performance, time, and resource constraints. Another goal of this research is to make Ensembles more explainable and intelligible using the Human-Centered approach. To achieve the aforementioned goals, we proposed a Human-Centered Behavior-inspired algorithm that streamlines the Ensemble Learning process while also reducing time, cost, and resource overhead, resulting in the superior performance of Supervised Learning in real-world applications. To demonstrate the effectiveness of our method, we perform our experiments on nine real-world datasets. Experimental results reveal that the proposed method satisfies our goals and outperforms the existing methods.
Large Language Models (LLMs), such as GPT-4 and BERT, have rapidly gained traction in natural language processing (NLP) and are now integral to financial decision-making. However, their deployment introduces critical challenges, particularly in perpetuating gender biases that can distort decision-making outcomes in high-stakes economic environments. This paper investigates gender bias in LLMs through both mathematical proofs and empirical experiments using the Word Embedding Association Test (WEAT), demonstrating that LLMs inherently reinforce gender stereotypes even without explicit gender markers. By comparing the decision-making processes of humans and LLMs, we reveal fundamental differences: while humans can override biases through ethical reasoning and individualized understanding, LLMs maintain bias as a rational outcome of their mathematical optimization on biased data. Our analysis proves that bias in LLMs is not an unintended flaw but a systematic result of their rational processing, which tends to preserve and amplify existing societal biases encoded in training data. Drawing on existentialist theory, we argue that LLM-generated bias reflects entrenched societal structures and highlights the limitations of purely technical debiasing methods. This research underscores the need for new theoretical frameworks and interdisciplinary methodologies that address the ethical implications of integrating LLMs into economic and financial decision-making. We advocate for a reconceptualization of how LLMs influence economic decisions, emphasizing the importance of incorporating human-like ethical considerations into AI governance to ensure fairness and equity in AI-driven financial systems.
Power outages have become increasingly frequent, intense, and prolonged in the US due to climate change, aging electrical grids, and rising energy demand. However, largely due to the absence of granular spatiotemporal outage data, we lack data-driven evidence and analytics-based metrics to quantify power system vulnerability. This limitation has hindered the ability to effectively evaluate and address vulnerability to power outages in US communities. Here, we collected ~179 million power outage records at 15-minute intervals across 3022 US contiguous counties (96.15% of the area) from 2014 to 2023. We developed a power system vulnerability assessment framework based on three dimensions (intensity, frequency, and duration) and applied interpretable machine learning models (XGBoost and SHAP) to compute Power System Vulnerability Index (PSVI) at the county level. Our analysis reveals a consistent increase in power system vulnerability over the past decade. We identified 318 counties across 45 states as hotspots for high power system vulnerability, particularly in the West Coast (California and Washington), the East Coast (Florida and the Northeast area), the Great Lakes megalopolis (Chicago-Detroit metropolitan areas), and the Gulf of Mexico (Texas). Heterogeneity analysis indicates that urban counties, counties with interconnected grids, and states with high solar generation exhibit significantly higher vulnerability. Our results highlight the significance of the proposed PSVI for evaluating the vulnerability of communities to power outages. The findings underscore the widespread and pervasive impact of power outages across the country and offer crucial insights to support infrastructure operators, policymakers, and emergency managers in formulating policies and programs aimed at enhancing the resilience of the US power infrastructure.
This work contributes to the field of Machine Ethics (ME) benchmarking, which develops tests to assess whether intelligent systems accurately represent human values and act accordingly. We identify three major issues with current ME benchmarks: limited ecological validity due to unrealistic ethical dilemmas, unstructured question generation without clear inclusion/exclusion criteria, and a lack of scalability due to reliance on human annotations. Moreover, benchmarks often fail to include sufficient syntactic variations, reducing the robustness of findings. To address these gaps, we introduce two new ME benchmarks: the Triage Benchmark and the Medical Law (MedLaw) Benchmark, both featuring real-world ethical dilemmas from the medical domain. The MedLaw Benchmark, fully AI-generated, offers a scalable alternative. We also introduce context perturbations in our benchmarks to assess models' worst-case performance. Our findings reveal that ethics prompting does not always improve decision-making. Furthermore, context perturbations not only significantly reduce model performance but can also reverse error patterns and shift relative performance rankings. Lastly, our comparison of worst-case performance suggests that general model capability does not always predict strong ethical decision-making. We argue that ME benchmarks must approximate real-world scenarios and worst-case performance to ensure robust evaluation.
The evaluation of learning effectiveness requires the integration of objective test results and analysis of uncertain subjective evaluations. Fuzzy theory methods are suitable for handling fuzzy information and uncertainty to obtain comprehensive and accurate evaluation results. In this paper, we develop a Swing-based multi-attribute group decision-making (MAGDM) method under interval-valued q-rung orthopair fuzzy sets (IVq-ROFSs). Firstly, an extended interval-valued q rung orthopair Weber ordered weighted average (IVq-ROFWOWA) operator is introduced. Then the attribute weights deriving method is designed by using the optimized Swing algorithm. Furthermore, we develop a MAGDM method for evaluating students' learning effectiveness using the IVq-ROFWOWA operator and the Swing algorithm. Finally, a case of evaluating students' learning effectiveness is illustrated by using the proposed MAGDM method. The implementing results demonstrate that the proposed MAGDM method is feasible and effective, and the Swing algorithm enhances better differentiation in ranking alternatives compared to other methods.
This paper leverages insights from Alignment Theory (AT) research, which primarily focuses on the potential pitfalls of technical alignment in Artificial Intelligence, to critically examine the European Union's Artificial Intelligence Act (EU AI Act). In the context of AT research, several key failure modes - such as proxy gaming, goal drift, reward hacking or specification gaming - have been identified. These can arise when AI systems are not properly aligned with their intended objectives. The central logic of this report is: what can we learn if we treat regulatory efforts in the same way as we treat advanced AI systems? As we systematically apply these concepts to the EU AI Act, we uncover potential vulnerabilities and areas for improvement in the regulation.
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios, particularly in belief-related tasks. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data, which raises concerns for applications in healthcare and counseling, where engaging with a person's beliefs is critical. Third, we identify a salient bias in how LMs process first-person versus third-person beliefs, performing better on third-person tasks (80.7%) compared to first-person tasks (54.4%). Fourth, LMs lack a robust understanding of the factive nature of knowledge, namely, that knowledge inherently requires truth. Fifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the deeper reasoning. These findings highlight significant concerns about current LMs' ability to reason about truth, belief, and knowledge while emphasizing the need for advancements in these areas before broad deployment in critical sectors.