| Total: 43
We present a deep neural framework that allows users to create surfaces from a stream of sparse 3D sketch strokes. Our network consists of a global surface estimation module followed by a local surface refinement. This facilitates in the incremental prediction of surfaces. Thus, our proposed method works with 3D sketch strokes and estimate a surface interactively in real time. We compare the proposed method with various state-of-the-art methods and show its efficacy for surface fitting. Further, we integrate our method into an existing Blender based 3D content creation pipeline to show its usefulness in 3D modelling.
We explore the transformer neural network architecture for modeling music, specifically Irish and Swedish traditional dance music. Given the repetitive structures of these kinds of music, the transformer should be as successful with fewer parameters and complexity as the hitherto most successful model, a vanilla long short-term memory network. We find that achieving good performance with the transformer is not straightforward, and careful consideration is needed for the sampling strategy, evaluating intermediate outputs in relation to engineering choices, and finally analyzing what the model learns. We discuss these points with several illustrations, providing reusable insights for engineering other music generation systems. We also report the high performance of our final transformer model in a competition of music generation systems focused on a type of Swedish dance.
Synthesizer is a type of electronic musical instrument that is now widely used in modern music production and sound design. Each parameters configuration of a synthesizer produces a unique timbre and can be viewed as a unique instrument. The problem of estimating a set of parameters configuration that best restore a sound timbre is an important yet complicated problem, i.e.: the synthesizer parameters estimation problem. We proposed a multi-modal deep-learning-based pipeline Sound2Synth, together with a network structure Prime-Dilated Convolution (PDC) specially designed to solve this problem. Our method achieved not only SOTA but also the first real-world applicable results on Dexed synthesizer, a popular FM synthesizer.
Deep generative models, such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), have been employed widely in computational creativity research. However, such models discourage out-of-distribution generation to avoid spurious sample generation, thereby limiting their creativity. Thus, incorporating research on human creativity into generative deep learning techniques presents an opportunity to make their outputs more compelling and human-like. As we see the emergence of generative models directed toward creativity research, a need for machine learning-based surrogate metrics to characterize creative output from these models is imperative. We propose group-based subset scanning to identify, quantify, and characterize creative processes by detecting a subset of anomalous node-activations in the hidden layers of the generative models. Our experiments on the standard image benchmarks and their ``creatively generated'' variants reveal that the proposed subset scores distribution is more useful for detecting novelty in creative processes in the activation space rather than the pixel space. Further, we found that creative samples generate larger subsets of anomalies than normal or non-creative samples across datasets. The node activations highlighted during the creative decoding process are different from those responsible for the normal sample generation. Lastly, we assess if the images from the subsets selected by our method were also found creative by human evaluators, presenting a link between creativity perception in humans and node activations within deep neural nets.
Creating art is often viewed as a uniquely human endeavor. In this paper, we introduce a multi-conditional Generative Adversarial Network (GAN) approach trained on large amounts of human paintings to synthesize realistic-looking paintings that emulate human art. Our approach is based on the StyleGAN neural network architecture, but incorporates a custom multi-conditional control mechanism that provides fine-granular control over characteristics of the generated paintings, e.g., with regard to the perceived emotion evoked in a spectator. We also investigate several evaluation techniques tailored to multi-conditional generation.
To best assist human designers with different styles, Machine Learning (ML) systems need to be able to adapt to them. However, there has been relatively little prior work on how and when to best adapt an ML system to a co-designer. In this paper we present threshold designer adaptation: a novel method for adapting a creative ML model to an individual designer. We evaluate our approach with a human subject study using a co-creative rhythm game design tool. We find that designers prefer our proposed method and produce higher quality content in comparison to an existing baseline.
Story sifting (or story recognition) allows for the exploration of events, stories, and patterns that emerge from simulated storyworlds. The goal of this work is to reduce the authoring burden for creating sifting queries. In this paper, we use the event traces of simulated storyworlds to create Dynamic Character Networks that track the changing relationship scores between characters in a simulation. These networks allow for the fortunes between any two characters to be plotted against time as a story arc. Similarity scores between story arcs from the simulation and a user’s query arc can be calculated using the Dynamic Time Warping algorithm. Events corresponding to the story arc that best matches the query arc can then be returned to the user, thus providing an intuitive means for users to sift a variety of stories without coding a search query. These components are implemented in our experimental prototype ARC SIFT. The results of a user study support our expectation that ARC SIFT is an intuitive and accurate tool that allows human users to sift stories out from a larger chronicle of events emerging from a simulated story world.
Universal video style transfer aims to migrate arbitrary styles to input videos. However, how to maintain the temporal consistency of videos while achieving high-quality arbitrary style transfer is still a hard nut to crack. To resolve this dilemma, in this paper, we propose the CSBNet which involves three key modules: 1) the Crystallization (Cr) Module that generates several orthogonal crystal nuclei, representing hierarchical stability-aware content and style components, from raw VGG features; 2) the Separation (Sp) Module that separates these crystal nuclei to generate the stability-enhanced content and style features; 3) the Blending (Bd) Module to cross-blend these stability-enhanced content and style features, producing more stable and higher-quality stylized videos. Moreover, we also introduce a new pair of component enhancement losses to improve network performance. Extensive qualitative and quantitative experiments are conducted to demonstrate the effectiveness and superiority of our CSBNet. Compared with the state-of-the-art models, it not only produces temporally more consistent and stable results for arbitrary videos but also achieves higher-quality stylizations for arbitrary images.
Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We present an approach for generating styled drawings for a given text description where a user can specify a desired drawing style using a sample image. Inspired by a theory in art that style and content are generally inseparable during the creative process, we propose a coupled approach, known here as StyleCLIPDraw, whereby the drawing is generated by optimizing for style and content simultaneously throughout the process as opposed to applying style transfer after creating content in a sequence. Based on human evaluation, the styles of images generated by StyleCLIPDraw are strongly preferred to those by the sequential approach. Although the quality of content generation degrades for certain styles, overall considering both content and style, StyleCLIPDraw is found far more preferred, indicating the importance of style, look, and feel of machine generated images to people as well as indicating that style is coupled in the drawing process itself. Our code, a demonstration, and style evaluation data are publicly available.
Character recognition models rely substantially on image datasets that maintain a balance of class samples. However, achieving a balance of classes is particularly challenging for ancient manuscript contexts as character instances may be significantly limited. In this paper, we present findings from a study that assess the efficacy of using synthetically generated character instances to augment an existing dataset of ancient Greek character images for use in machine learning models. We complement our model exploration by engaging professional papyrologists to better understand the practical opportunities afforded by synthetic instances. Our results suggest that synthetic instances improve model performance for limited character classes, and may have unexplored effects on character classes more generally. We also find that trained papyrologists are unable to distinguish between synthetic and non-synthetic images and regard synthetic instances as valuable assets for professional and educational contexts. We conclude by discussing the practical implications of our research.
Gram-based and patch-based approaches are two important research lines of style transfer. Recent diversified Gram-based methods have been able to produce multiple and diverse stylized outputs for the same content and style images. However, as another widespread research interest, the diversity of patch-based methods remains challenging due to the stereotyped style swapping process based on nearest patch matching. To resolve this dilemma, in this paper, we dive into the crux of existing patch-based methods and propose a universal and efficient module, termed DivSwapper, for diversified patch-based arbitrary style transfer. The key insight is to use an essential intuition that neural patches with higher activation values could contribute more to diversity. Our DivSwapper is plug-and-play and can be easily integrated into existing patch-based and Gram-based methods to generate diverse results for arbitrary styles. We conduct theoretical analyses and extensive experiments to demonstrate the effectiveness of our method, and compared with state-of-the-art algorithms, it shows superiority in diversity, quality, and efficiency.
Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion realism, diversity and music consistency. In this paper, we propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographies from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music. This gives a well defined and non-divergent training objective that mitigates the limitation of standard GAN training which is frequently plagued with instability and divergent generator loss issues. Extensive experiments demonstrate that our MDOT-Net can synthesize realistic and diverse dances which achieve an organic unity with the input music, reflecting the shared intentionality and matching the rhythmic articulation. Sample results are found at https://www.youtube.com/watch?v=dErfBkrlUO8.
In this paper, we study the graphic layout generation problem of producing high-quality visual-textual presentation designs for given images. We note that image compositions, which contain not only global semantics but also spatial information, would largely affect layout results. Hence, we propose a deep generative model, dubbed as composition-aware graphic layout GAN (CGL-GAN), to synthesize layouts based on the global and spatial visual contents of input images. To obtain training images from images that already contain manually designed graphic layout data, previous work suggests masking design elements (e.g., texts and embellishments) as model inputs, which inevitably leaves hint of the ground truth. We study the misalignment between the training inputs (with hint masks) and test inputs (without masks), and design a novel domain alignment module (DAM) to narrow this gap. For training, we built a large-scale layout dataset which consists of 60,548 advertising posters with annotated layout information. To evaluate the generated layouts, we propose three novel metrics according to aesthetic intuitions. Through both quantitative and qualitative evaluations, we demonstrate that the proposed model can synthesize high-quality graphic layouts according to image compositions. The data and code will be available at https://github.com/minzhouGithub/CGL-GAN.
Artistic style transfer is the task of synthesizing content images with learned artistic styles. Recent studies have shown the potential of Generative Adversarial Networks (GANs) for producing artistically rich stylizations. Despite the promising results, they usually fail to control the generated images' style degree, which is inflexible and limits their applicability for practical use. To address the issue, in this paper, we propose a novel method that for the first time allows adjusting the style degree for existing GAN-based artistic style transfer frameworks in real time after training. Our method introduces two novel modules into existing GAN-based artistic style transfer frameworks: a Style Scaling Injection (SSI) module and a Style Degree Interpretation (SDI) module. The SSI module accepts the value of Style Degree Factor (SDF) as the input and outputs parameters that scale the feature activations in existing models, offering control signals to alter the style degrees of the stylizations. And the SDI module interprets the output probabilities of a multi-scale content-style binary classifier as the style degrees, providing a mechanism to parameterize the style degree of the stylizations. Moreover, we show that after training our method can enable existing GAN-based frameworks to produce over-stylizations. The proposed method can facilitate many existing GAN-based artistic style transfer frameworks with marginal extra training overheads and modifications. Extensive qualitative evaluations on two typical GAN-based style transfer models demonstrate the effectiveness of the proposed method for gaining style degree control for them.
Air pollution monitoring and management is one of the key challenges for urban sectors, especially in developing countries. Measuring pollution levels requires significant investment in reliable and durable instrumentation and subsequent maintenance. On the other hand, there have been many attempts by researchers to develop image-based pollution measurement models which have shown significant results and established the feasibility of the idea. But, taking image-level models to a city-level system presents new challenges, which include scarcity of high-quality annotated data and a high amount of label noise. In this paper, we present a low-cost, end-to-end system for learning pollution maps using images captured through a mobile phone. We demonstrate our system for parts of New Delhi and Ghaziabad. We use transfer learning to overcome the problem of data scarcity. We investigate the effects of label noise in detail and introduce the metric of in-interval accuracy to evaluate our models in presence of noise. We use distributed averaging to learn pollution maps and mitigate the effects of noise to some extent. We also develop haze-based interpretable models which have comparable performance to mainstream models. With only 382 images from Delhi and Ghaziabad and single-scene dataset from Beijing and Shanghai, we are able to achieve a mean absolute error of 44 ug/m^3 in PM2.5 concentration on a test set of 267 images and an in-interval accuracy of 67% on predictions. Going further, we learn pollution maps with a mean absolute error as low as 35 ug/m^3 and in-interval accuracy as high as 74% significantly mitigating the image models' error. We also show that the noise in pollution labels emerging from unreliable sensing instrumentation forms a significant barrier to the realization of an ideal air pollution monitoring system. Our codebase can be found at https://github.com/ankitbha/pollution_with_images.
Air pollution poses a serious threat to sustainable environmental conditions in the 21st century. Its importance in determining the health and living standards in urban settings is only expected to increase with time. Various factors ranging from artificial emissions to natural phenomena are known to be primary causal agents or influencers behind rising air pollution levels. However, the lack of large scale data involving the major artificial and natural factors has hindered the research on the causes and relations governing the variability of the different air pollutants. Through this work, we introduce a large scale city-wise dataset for exploring the relationships among these agents over a long period of time. We also introduce a transformer based model - cosSquareFormer, for the problem of pollutant level estimation and forecasting. Our model outperforms most of the benchmark models for this task. We also analyze and explore the dataset through our model and other methodologies to bring out important inferences which enable us to understand the dynamics of the casual agents at a deeper level. Through our paper, we seek to provide a great set of foundations for further research into this domain that will demand critical attention of ours in the near future.
Chronic disease management involves frequent administration of invasive lab procedures in order for clinicians to determine the best course of treatment regimes for these patients. However, patients are often put off by these invasive lab procedures and do not follow the appointment schedules. This has resulted in poor management of their chronic conditions leading to unnecessary disease complications. An AI system that is able to personalize the prediction of individual patient lab test responses will enable clinicians to titrate the medications to achieve the desired therapeutic outcome. Accurate prediction of lab test response is a challenge because these patients typically have co-morbidities and their treatments might influence the target lab test response. To address this, we model the complex interactions among different medications, diseases, lab test response, and fine-grained dosage information to learn a strong patient representation. Together with information from similar patients and external knowledge such as drug-lab interactions and diagnosis-lab interaction, we design a system called KALP to perform personalized prediction of patients’ response for a target lab result and identify the top influencing factors for the prediction. Experiment results on real-world datasets demonstrate the effectiveness of KALP in reducing prediction errors by a significant margin. Case studies show that the identified factors are consistent with clinicians’ understanding.
Movement and pose assessment of newborns lets experienced pediatricians predict neurodevelopmental disorders, allowing early intervention for related diseases. However, most of the newest AI approaches for human pose estimation methods focus on adults, lacking publicly benchmark for infant pose estimation. In this paper, we fill this gap by proposing infant pose dataset and Deep Aggregation Vision Transformer for human pose estimation, which introduces a fast trained full transformer framework without using convolution operations to extract features in the early stages. It generalizes Transformer + MLP to high-resolution deep layer aggregation within feature maps, thus enabling information fusion between different vision levels. We pre-train AggPose on COCO pose dataset and apply it on our newly released large-scale infant pose estimation dataset. The results show that AggPose could effectively learn the multi-scale features among different resolutions and significantly improve the performance of infant pose estimation. We show that AggPose outperforms hybrid model HRFormer and TokenPose in the infant pose estimation dataset. Moreover, our AggPose outperforms HRFormer by 0.8 AP on COCO val pose estimation on average. Our code is available at github.com/SZAR-LAB/AggPose.
Kidney exchanges allow patients with end-stage renal disease to find a lifesaving living donor by way of an organized market. However, not all patients are equally easy to match, nor are all donor organs of equal quality---some patients are matched within weeks, while others may wait for years with no match offers at all. We propose the first decision-support tool for kidney exchange that takes as input the biological features of a patient-donor pair, and returns (i) the probability of being matched prior to expiry, and (conditioned on a match outcome), (ii) the waiting time for and (iii) the organ quality of the matched transplant. This information may be used to inform medical and insurance decisions. We predict all quantities (i, ii, iii) exclusively from match records that are readily available in any kidney exchange using a quantile random forest approach. To evaluate our approach, we developed two state-of-the-art realistic simulators based on data from the United Network for Organ Sharing that sample from the training and test distribution for these learning tasks---in our application these distributions are distinct. We analyze distributional shift through a theoretical lens, and show that the two distributions converge as the kidney exchange nears steady-state. We then show that our approach produces clinically-promising estimates using simulated data. Finally, we show how our approach, in conjunction with tools from the model explainability literature, can be used to calibrate and detect bias in matching policies.
In this paper, we analyze the responses of three major US cable news networks to three seminal policing events in the US spanning a thirteen month period--the murder of George Floyd by police officer Derek Chauvin, the Capitol riot, Chauvin's conviction, and his sentencing. We cast the problem of aggregate stance mining as a natural language inference task and construct an active learning pipeline for robust textual entailment prediction. Via a substantial corpus of 34,710 news transcripts, our analyses reveal that the partisan divide in viewership of these three outlets reflects on the network's news coverage of these momentous events. In addition, we release a sentence-level, domain-specific text entailment data set on policing consisting of 2,276 annotated instances.
Monitoring vegetation productivity at extremely fine resolutions is valuable for real-world agricultural applications, such as detecting crop stress and providing early warning of food insecurity. Solar-Induced Chlorophyll Fluorescence (SIF) provides a promising way to directly measure plant productivity from space. However, satellite SIF observations are only available at a coarse spatial resolution, making it impossible to monitor how individual crop types or farms are doing. This poses a challenging coarsely-supervised regression (or downscaling) task; at training time, we only have SIF labels at a coarse resolution (3km), but we want to predict SIF at much finer spatial resolutions (e.g. 30m, a 100x increase). We also have additional fine-resolution input features, but the relationship between these features and SIF is unknown. To address this, we propose Coarsely-Supervised Smooth U-Net (CS-SUNet), a novel method for this coarse supervision setting. CS-SUNet combines the expressive power of deep convolutional networks with novel regularization methods based on prior knowledge (such as a smoothness loss) that are crucial for preventing overfitting. Experiments show that CS-SUNet resolves fine-grained variations in SIF more accurately than existing methods.
The World Health Organization (WHO) has emphasized the importance of significantly accelerating suicide prevention efforts to fulfill the United Nations' Sustainable Development Goal (SDG) objective of 2030. In this paper, we present an end-to-end multitask system to address a novel task of detection of two interpersonal risk factors of suicide, Perceived Burdensomeness (PB) and Thwarted Belongingness (TB) from suicide notes. We also introduce a manually translated code-mixed suicide notes corpus, CoMCEASE-v2.0, based on the benchmark CEASE-v2.0 dataset, annotated with temporal orientation, PB and TB labels. We exploit the temporal orientation and emotion information in the suicide notes to boost overall performance. For comprehensive evaluation of our proposed method, we compare it to several state-of-the-art approaches on the existing CEASE-v2.0 dataset and the newly announced CoMCEASE-v2.0 dataset. Empirical evaluation suggests that temporal and emotional information can substantially improve the detection of PB and TB.
AI recommendation techniques provide users with personalized services, feeding them the information they may be interested in. The increasing personalization raises the hypotheses of the "filter bubble" and "echo chamber" effects. To investigate these hypotheses, in this paper, we inspect the impact of recommendation algorithms on forming two types of ideological isolation, i.e., the individual isolation and the topological isolation, in terms of the filter bubble and echo chamber effects, respectively. Simulation results show that AI recommendation strategies severely facilitate the evolution of the filter bubble effect, leading users to become ideologically isolated at an individual level. Whereas, at a topological level, recommendation algorithms show eligibility in connecting individuals with dissimilar users or recommending diverse topics to receive more diverse viewpoints. This research sheds light on the ability of AI recommendation strategies to temper ideological isolation at a topological level.
In this paper, we study an explainable COVID-19 misinformation detection problem where the goal is to accurately identify COVID-19 misleading posts on social media and explain the posts with natural language explanations (NLEs). Our problem is motivated by the limitations of current explainable misinformation detection approaches that cannot provide NLEs for COVID-19 posts due to the lack of sufficient professional COVID-19 knowledge for supervision. To address such a limitation, we develop CEA-COVID, a crowd-expert-AI framework that jointly exploits the common logical reasoning ability of online crowd workers and the professional knowledge of COVID-19 experts to effectively generate NLEs for detecting and explaining COVID-19 misinformation. We evaluate CEA-COVID using two public COVID-19 misinformation datasets on social media. Results demonstrate that CEA-COVID outperforms existing explainable misinformation detection models in terms of both explainability and detection accuracy.
Evaluating the high-effect factors of citizens' happiness is beneficial to a wide range of policy-making for economics and politics in most countries. Benefiting from the high-efficiency of regression models, previous efforts by sociology scholars have analyzed the effect of happiness factors with high interpretability. However, restricted to their research concerns, they are specifically interested in some subset of factors modeled as linear functions. Recently, deep learning shows promising prediction accuracy while addressing challenges in interpretability. To this end, we introduce Shapley value that is inherent in solid theory for factor contribution interpretability to work with deep learning models by taking into account interactions between multiple factors. The proposed solution computes the Shapley value of a factor, i.e., its average contribution to the prediction in different coalitions based on coalitional game theory. Aiming to evaluate the interpretability quality of our solution, experiments are conducted on a Chinese General Social Survey (CGSS) questionnaire dataset. Through systematic reviews, the experimental results of Shapley value are highly consistent with academic studies in social science, which implies our solution for citizens' happiness prediction has 2-fold implications, theoretically and practically.