| Total: 1000
Gender bias in political discourse is a significant problem on today's social media. Previous studies found that the gender of politicians indeed influences the content directed towards them by the general public. However, these works are particularly focused on the global north, which represents individualistic culture. Furthermore, they did not address whether there is gender bias even within the interaction between popular journalists and politicians in the global south. These understudied journalist-politician interactions are important (more so in collectivistic cultures like the global south) as they can significantly affect public sentiment and help set gender-biased social norms. In this work, using large-scale data from Indian Twitter we address this research gap. We curated a gender-balanced set of 100 most-followed Indian journalists on Twitter and 100 most-followed politicians. Then we collected 21,188 unique tweets posted by these journalists that mentioned these politicians. Our analysis revealed that there is a significant gender bias -- the frequency with which journalists mention male politicians vs. how frequently they mention female politicians is statistically significantly different (p<<0.05). In fact, median tweets from female journalists mentioning female politicians received ten times fewer likes than median tweets from female journalists mentioning male politicians. However, when we analyzed tweet content, our emotion score analysis and topic modeling analysis did not reveal any significant gender-based difference within the journalists' tweets towards politicians. Finally, we found a potential reason for the significant gender bias: the number of popular male Indian politicians is almost twice as large as the number of popular female Indian politicians, which might have resulted in the observed bias. We conclude by discussing the implications of this work.
Stopwords are fundamental in Natural Language Processing (NLP) techniques for information retrieval. One of the common tasks in preprocessing of text data is the removal of stopwords. Currently, while high-resource languages like English benefit from the availability of several stopwords, low-resource languages, such as those found in the African continent, have none that are standardized and available for use in NLP packages. Stopwords in the context of African languages are understudied and can reveal information about the crossover between languages. The \textit{African Stopwords} project aims to study and curate stopwords for African languages. In this paper, we present our current progress on ten African languages as well as future plans for the project.
A catalogue of African Doctorates in Mathematics has been compiled and published in 2007 by the late Professor Paulus Gerdes. In this paper, we revise and update the list of mathematicians from Burkina-Faso. Starting from a short description of Burkina-Faso, we brieffly mention the education system and the mathematics programme in Burkina-Faso from the pre-school level to university level. A list of mathematicians native of Burkina Faso is given.
This paper studies the Multi-period Travelling Politician Problem whose objective is to maximise the net benefit accrued by a party leader during a fixed campaign period. The problem is also characterised by flexible depots since the daily tours realised by the party leader may not start and end at the same city. A hybrid multi-start Iterated Local Search method complemented with a Variable Neighbourhood Descent is developed to solve the problem heuristically. Two constructive procedures are devised to generate initial feasible solutions. The proposed method is tested on 45 problem instances involving 81 cities and 12 towns in Turkey. Computational results show that the hybrid metaheuristic approach outperforms a recently proposed two-phase matheuristic by producing 7 optimal solutions and 17 new best solutions. In addition, interesting practical insights are provided using scenario analysis that could assist campaign planners in their strategic decisions.
Science journalism reports current scientific discoveries to non-specialists, aiming to enable public comprehension of the state of the art. This task is challenging as the audience often lacks specific knowledge about the presented research. We propose a JRE-L framework that integrates three LLMs mimicking the writing-reading-feedback-revision loop. In JRE-L, one LLM acts as the journalist, another LLM as the general public reader, and the third LLM as an editor. The journalist's writing is iteratively refined by feedback from the reader and suggestions from the editor. Our experiments demonstrate that by leveraging the collaboration of two 7B and one 1.8B open-source LLMs, we can generate articles that are more accessible than those generated by existing methods, including prompting single advanced models such as GPT-4 and other LLM-collaboration strategies. Our code is publicly available at github.com/Zzoay/JRE-L.
Slating a product for release often involves pitching journalists to run stories on your press release. Good media coverage often ensures greater product reach and drives audience engagement for those products. Hence, ensuring that those releases are pitched to the right journalists with relevant interests is crucial, since they receive several pitches daily. Keeping up with journalist beats and curating a media contacts list is often a huge and time-consuming task. This study proposes a model to automate and expedite the process by recommending suitable journalists to run media coverage on the press releases provided by the user.
While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context, we propose a novel task named cross-talk reduction (CTR) which aims at reducing cross-talk speech, and a novel solution named CTRnet which is based on unsupervised or weakly-supervised neural speech separation. In unsupervised CTRnet, close-talk and far-field mixtures are stacked as input for a DNN to estimate the close-talk speech of each speaker. It is trained in an unsupervised, discriminative way such that the DNN estimate for each speaker can be linearly filtered to cancel out the speaker's cross-talk speech captured at other microphones. In weakly-supervised CTRnet, we assume the availability of each speaker's activity timestamps during training, and leverage them to improve the training of unsupervised CTRnet. Evaluation results on a simulated two-speaker CTR task and on a real-recorded conversational speech separation and recognition task show the effectiveness and potential of CTRnet.
African languages are numerous, complex and low-resourced. The datasets required for machine translation are difficult to discover, and existing research is hard to reproduce. Minimal attention has been given to machine translation for African languages so there is scant research regarding the problems that arise when using machine translation techniques. To begin addressing these problems, we trained models to translate English to five of the official South African languages (Afrikaans, isiZulu, Northern Sotho, Setswana, Xitsonga), making use of modern neural machine translation techniques. The results obtained show the promise of using neural machine translation techniques for African languages. By providing reproducible publicly-available data, code and results, this research aims to provide a starting point for other researchers in African machine translation to compare to and build upon.
As language and speech technologies become more advanced, the lack of fundamental digital resources for African languages, such as data, spell checkers and Part of Speech taggers, means that the digital divide between these languages and others keeps growing. This work details the organisation of the AI4D - African Language Dataset Challenge, an effort to incentivize the creation, organization and discovery of African language datasets through a competitive challenge. We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
(ABRIDGED) We report the genome sequencing of 139 wild-derived strains of D. melanogaster, representing 22 population samples from the sub-Saharan ancestral range of this species, along with one European population. Most genomes were sequenced above 25X depth from haploid embryos. Results indicated a pervasive influence of non-African admixture in many African populations, motivating the development and application of a novel admixture detection method. Admixture proportions varied among populations, with greater admixture in urban locations. Admixture levels also varied across the genome, with localized peaks and valleys suggestive of a non-neutral introgression process. Genomes from the same location differed starkly in ancestry, suggesting that isolation mechanisms may exist within African populations. After removing putatively admixed genomic segments, the greatest genetic diversity was observed in southern Africa (e.g. Zambia), while diversity in other populations was largely consistent with a geographic expansion from this potentially ancestral region. The European population showed different levels of diversity reduction on each chromosome arm, and some African populations displayed chromosome arm-specific diversity reductions. Inversions in the European sample were associated with strong elevations in diversity across chromosome arms. Genomic scans were conducted to identify loci that may represent targets of positive selection. A disproportionate number of candidate selective sweep regions were located near genes with varied roles in gene regulation. Outliers for Europe-Africa FST were found to be enriched in genomic regions of locally elevated cosmopolitan admixture, possibly reflecting a role for some of these loci in driving the introgression of non-African alleles into African populations.
Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, for example, asking a voice assistant to play a track from a certain artist, initiating navigation to a specific location, or documenting a laboratory result for a patient. However, where named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or ``Ingabire`` (Rwandan) are spoken, automatic speech recognition (ASR) models' performance degrades significantly, propagating errors to downstream systems. We model this problem as a distribution shift and demonstrate that such model bias can be mitigated through multilingual pre-training, intelligent data augmentation strategies to increase the representation of African-named entities, and fine-tuning multilingual ASR models on multiple African accents. The resulting fine-tuned models show an 81.5\% relative WER improvement compared with the baseline on samples with African-named entities.
Advances in speech and language technologies enable tools such as voice-search, text-to-speech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D - African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets through hosting of competitive ML challenges.
The advent of international wideband communication by optical fibre has produced a revolution in communications and the use of the internet. Many African countries are now connected to undersea fibre linking them to other African countries and to other continents. Previously international communication was by microwave links through geostationary satellites. These are becoming redundant in some countries as optical fibre takes over, as this provides 1000 times the bandwidth of the satellite links. In the 1970's and 1980's some two dozen large (30 m diameter class) antennas were built in various African countries to provide the satellite links. Twenty six are currently known in 19 countries. As these antennas become redundant, the possibility exists to convert them for radio astronomy at a cost of roughly one tenth that of a new antenna of similar size. HartRAO, SKA Africa and the South African Department of Science and Technology (DST) have started exploring this possibility with some of the African countries.
Pretrained language models (PLMs) for African languages are continually improving, but the reasons behind these advances remain unclear. This paper presents the first systematic investigation into probing PLMs for linguistic knowledge about African languages. We train layer-wise probes for six typologically diverse African languages to analyse how linguistic features are distributed. We also design control tasks, a way to interpret probe performance, for the MasakhaPOS dataset. We find PLMs adapted for African languages to encode more linguistic information about target languages than massively multilingual PLMs. Our results reaffirm previous findings that token-level syntactic information concentrates in middle-to-last layers, while sentence-level semantic information is distributed across all layers. Through control tasks and probing baselines, we confirm that performance reflects the internal knowledge of PLMs rather than probe memorisation. Our study applies established interpretability techniques to African-language PLMs. In doing so, we highlight the internal mechanisms underlying the success of strategies like active learning and multilingual adaptation.
Sentiment analysis is a fundamental and valuable task in NLP. However, due to limitations in data and technological availability, research into sentiment analysis of African languages has been fragmented and lacking. With the recent release of the AfriSenti-SemEval Shared Task 12, hosted as a part of The 17th International Workshop on Semantic Evaluation, an annotated sentiment analysis of 14 African languages was made available. We benchmarked and compared current state-of-art transformer models across 12 languages and compared the performance of training one-model-per-language versus single-model-all-languages. We also evaluated the performance of standard multilingual models and their ability to learn and transfer cross-lingual representation from non-African to African languages. Our results show that despite work in low resource modeling, more data still produces better models on a per-language basis. Models explicitly developed for African languages outperform other models on all tasks. Additionally, no one-model-fits-all solution exists for a per-language evaluation of the models evaluated. Moreover, for some languages with a smaller sample size, a larger multilingual model may perform better than a dedicated per-language model for sentiment classification.
Newsroom in online ecosystem is difficult to untangle. With prevalence of social media, interactions between journalists and individuals become visible, but lack of understanding to inner processing of information feedback loop in public sphere leave most journalists baffled. Can we provide an organized view to characterize journalist behaviors on individual level to know better of the ecosystem? To this end, I propose Poisson Factorization Machine (PFM), a Bayesian analogue to matrix factorization that assumes Poisson distribution for generative process. The model generalizes recent studies on Poisson Matrix Factorization to account temporal interaction which involves tensor-like structure, and label information. Two inference procedures are designed, one based on batch variational EM and another stochastic variational inference scheme that efficiently scales with data size. An important novelty in this note is that I show how to stack layers of PFM to introduce a deep architecture. This work discusses some potential results applying the model and explains how such latent factors may be useful for analyzing latent behaviors for data exploration.
Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).
COVID-19 has aided the spread of racism, as well as national insecurity, distrust of immigrants, and general xenophobia, both of which may be linked to the rise in anti-Asian hate crimes during the pandemic. Coronavirus Disease 2019(COVID19) is thought to have originated in late December 2019 in Wuhan, China, and quickly spread across the world during the spring months of 2020. Asian Americans recorded in increase in racially based hate crimes including physical abuse and intimidation as COVID-19 spread throughout the United States. This research study was conducted by high school students in the Bay Area to compare the intention and characteristics of hate crimes against Asian Americans to hate crimes against African Americans. According to studies of both victim-related and most offender-related variables, hate crimes against Asian Americans have been rapidly growing in the United States and vary from those against African Americans. This leads to an investigation into the racial disparity between Asian American offenders and those of other races. The nature and characteristics of hate crimes against Asian Americans are compared to those of hate crimes against African Americans in our research. According to studies of all victim-related factors, hate crimes against Asian Americans are similar to those against African Americans. Hate crimes against Asian Americans, on the other hand, vary greatly from hate crimes against African Americans in terms of the offender's ethnicity and all incident-related variables.
Low-resource African languages pose unique challenges for natural language processing (NLP) tasks, including natural language generation (NLG). In this paper, we develop Cheetah, a massively multilingual NLG language model for African languages. Cheetah supports 517 African languages and language varieties, allowing us to address the scarcity of NLG resources and provide a solution to foster linguistic diversity. We demonstrate the effectiveness of Cheetah through comprehensive evaluations across six generation downstream tasks. In five of the six tasks, Cheetah significantly outperforms other models, showcasing its remarkable performance for generating coherent and contextually appropriate text in a wide range of African languages. We additionally conduct a detailed human evaluation to delve deeper into the linguistic capabilities of Cheetah. The introduction of Cheetah has far-reaching benefits for linguistic diversity. By leveraging pretrained models and adapting them to specific languages, our approach facilitates the development of practical NLG applications for African communities. The findings of this study contribute to advancing NLP research in low-resource settings, enabling greater accessibility and inclusion for African languages in a rapidly expanding digital landscape. We publicly release our models for research.
In this paper, we focus on the task of multilingual machine translation for African languages and describe our contribution in the 2021 WMT Shared Task: Large-Scale Multilingual Machine Translation. We introduce MMTAfrica, the first many-to-many multilingual translation system for six African languages: Fon (fon), Igbo (ibo), Kinyarwanda (kin), Swahili/Kiswahili (swa), Xhosa (xho), and Yoruba (yor) and two non-African languages: English (eng) and French (fra). For multilingual translation concerning African languages, we introduce a novel backtranslation and reconstruction objective, BT\&REC, inspired by the random online back translation and T5 modeling framework respectively, to effectively leverage monolingual data. Additionally, we report improvements from MMTAfrica over the FLORES 101 benchmarks (spBLEU gains ranging from +0.58 in Swahili to French to +19.46 in French to Xhosa). We release our dataset and code source at https://github.com/edaiofficial/mmtafrica.
This paper maps Africa's distinctive AI risk profile, from deepfake fuelled electoral interference and data colonial dependency to compute scarcity, labour disruption and disproportionate exposure to climate driven environmental costs. While major benefits are promised to accrue, the availability, development and adoption of AI also mean that African people and countries face particular AI safety risks, from large scale labour market disruptions to the nefarious use of AI to manipulate public opinion. To date, African perspectives have not been meaningfully integrated into global debates and processes regarding AI safety, leaving African stakeholders with limited influence over the emerging global AI safety governance agenda. While there are Computer Incident Response Teams on the continent, none hosts a dedicated AI Safety Institute or office. We propose a five-point action plan centred on (i) a policy approach that foregrounds the protection of the human rights of those most vulnerable to experiencing the harmful socio-economic effects of AI; (ii) the establishment of an African AI Safety Institute; (iii) promote public AI literacy and awareness; (iv) development of early warning system with inclusive benchmark suites for 25+ African languages; and (v) an annual AU-level AI Safety & Security Forum.
The advancement of speech technologies has been remarkable, yet its integration with African languages remains limited due to the scarcity of African speech corpora. To address this issue, we present AfroDigits, a minimalist, community-driven dataset of spoken digits for African languages, currently covering 38 African languages. As a demonstration of the practical applications of AfroDigits, we conduct audio digit classification experiments on six African languages [Igbo (ibo), Yoruba (yor), Rundi (run), Oshiwambo (kua), Shona (sna), and Oromo (gax)] using the Wav2Vec2.0-Large and XLS-R models. Our experiments reveal a useful insight on the effect of mixing African speech corpora during finetuning. AfroDigits is the first published audio digit dataset for African languages and we believe it will, among other things, pave the way for Afro-centric speech applications such as the recognition of telephone numbers, and street numbers. We release the dataset and platform publicly at https://huggingface.co/datasets/chrisjay/crowd-speech-africa and https://huggingface.co/spaces/chrisjay/afro-speech respectively.
Modern speech synthesis techniques can produce natural-sounding speech given sufficient high-quality data and compute resources. However, such data is not readily available for many languages. This paper focuses on speech synthesis for low-resourced African languages, from corpus creation to sharing and deploying the Text-to-Speech (TTS) systems. We first create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources and subject-matter expertise. Next, we create new datasets and curate datasets from "found" data (existing recordings) through a participatory approach while considering accessibility, quality, and breadth. We demonstrate that we can develop synthesizers that generate intelligible speech with 25 minutes of created speech, even when recorded in suboptimal environments. Finally, we release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
Unlike major Western languages, most African languages are very low-resourced. Furthermore, the resources that do exist are often scattered and difficult to obtain and discover. As a result, the data and code for existing research has rarely been shared. This has lead a struggle to reproduce reported results, and few publicly available benchmarks for African machine translation models exist. To start to address these problems, we trained neural machine translation models for 5 Southern African languages on publicly-available datasets. Code is provided for training the models and evaluate the models on a newly released evaluation set, with the aim of spur future research in the field for Southern African languages.
World Health Organization reports that African Trypanosomiasis affects mostly poor populations living in remote rural areas of Africa that can be fatal if properly not treated. This paper presents Dempster-Shafer Theory for the detection of African trypanosomiasis. Sustainable elimination of African trypanosomiasis as a public-health problem is feasible and requires continuous efforts and innovative approaches. In this research, we implement Dempster-Shafer theory for detecting African trypanosomiasis and displaying the result of detection process. We describe eleven symptoms as major symptoms which include fever, red urine, skin rash, paralysis, headache, bleeding around the bite, joint the paint, swollen lymph nodes, sleep disturbances, meningitis and arthritis. Dempster-Shafer theory to quantify the degree of belief, our approach uses Dempster-Shafer theory to combine beliefs under conditions of uncertainty and ignorance, and allows quantitative measurement of the belief and plausibility in our identification result.