| Total: 90
Social media has been developing rapidly in public due to its nature of spreading new information, which leads to rumors being circulated. Meanwhile, detecting rumors from such massive information in social media is becoming an arduous challenge. Therefore, some deep learning methods are applied to discover rumors through the way they spread, such as Recursive Neural Network (RvNN) and so on. However, these deep learning methods only take into account the patterns of deep propagation but ignore the structures of wide dispersion in rumor detection. Actually, propagation and dispersion are two crucial characteristics of rumors. In this paper, we propose a novel bi-directional graph model, named Bi-Directional Graph Convolutional Networks (Bi-GCN), to explore both characteristics by operating on both top-down and bottom-up propagation of rumors. It leverages a GCN with a top-down directed graph of rumor spreading to learn the patterns of rumor propagation; and a GCN with an opposite directed graph of rumor diffusion to capture the structures of rumor dispersion. Moreover, the information from source post is involved in each layer of GCN to enhance the influences from the roots of rumors. Encouraging empirical results on several benchmarks confirm the superiority of the proposed method over the state-of-the-art approaches.
Massive electronic health records (EHRs) enable the success of learning accurate patient representations to support various predictive health applications. In contrast, doctor representation was not well studied despite that doctors play pivotal roles in healthcare. How to construct the right doctor representations? How to use doctor representation to solve important health analytic problems? In this work, we study the problem on clinical trial recruitment, which is about identifying the right doctors to help conduct the trials based on the trial description and patient EHR data of those doctors. We propose Doctor2Vec which simultaneously learns 1) doctor representations from EHR data and 2) trial representations from the description and categorical information about the trials. In particular, Doctor2Vec utilizes a dynamic memory network where the doctor's experience with patients are stored in the memory bank and the network will dynamically assign weights based on the trial representation via an attention mechanism. Validated on large real-world trials and EHR data including 2,609 trials, 25K doctors and 430K patients, Doctor2Vec demonstrated improved performance over the best baseline by up to 8.7% in PR-AUC. We also demonstrated that the Doctor2Vec embedding can be transferred to benefit data insufficiency settings including trial recruitment in less populated/newly explored country with 13.7% improvement or for rare diseases with 8.1% improvement in PR-AUC.
The recent advances in computer-assisted learning systems and the availability of open educational resources today promise a pathway to providing cost-efficient high-quality education to large masses of learners. One of the most ambitious use cases of computer-assisted learning is to build a lifelong learning recommendation system. Unlike short-term courses, lifelong learning presents unique challenges, requiring sophisticated recommendation models that account for a wide range of factors such as background knowledge of learners or novelty of the material while effectively maintaining knowledge states of masses of learners for significantly longer periods of time (ideally, a lifetime). This work presents the foundations towards building a dynamic, scalable and transparent recommendation system for education, modelling learner's knowledge from implicit data in the form of engagement with open educational resources. We i) use a text ontology based on Wikipedia to automatically extract knowledge components of educational resources and, ii) propose a set of online Bayesian strategies inspired by the well-known areas of item response theory and knowledge tracing. Our proposal, TrueLearn, focuses on recommendations for which the learner has enough background knowledge (so they are able to understand and learn from the material), and the material has enough novelty that would help the learner improve their knowledge about the subject and keep them engaged. We further construct a large open educational video lectures dataset and test the performance of the proposed algorithms, which show clear promise towards building an effective educational recommendation system.
With the proliferation of GPS-based data (e.g., routes and trajectories), it is of great importance to enable the functionality of real-time route search and recommendations. We define and study a novel Continuous Route-Search-by-Location (C-RSL) problem to enable real-time route search by locations for a large number of users over route data streams. Given a set of C-RSL queries where each query q contains a set of places q.O to visit and a threshold q.θ, we continuously feed each query q with routes that has similarity to q.O no less than q.θ. We also extend our proposal to support top-k C-RSL problem where each query continuously maintains k most similar routes. The C-RSL problem targets a variety of applications, including real-time route planning, ridesharing, and other location-based services that have real-time demand. To enable efficient route matching on a large number of C-RSL queries, we develop novel parallel route matching algorithms with good time complexity. Extensive experiments with real data offer insight into the performance of our algorithms, indicating that our proposal is capable of achieving high efficiency and scalability.
Pricing is essential in optimizing transportation resource allocation. Congestion pricing is widely used to reduce urban traffic congestion. We propose and investigate a novel Dynamic Pricing Strategy (DPS) to price travelers' trips in intelligent transportation platforms (e.g., DiDi, Lyft, Uber). The trips are charged according to their “congestion contributions” to global urban traffic systems. The dynamic pricing strategy retrieves a matching between n travelers' trips and the potential travel routes (each trip has k potential routes) to minimize the global traffic congestion. We believe that DPS holds the potential to benefit society and the environment, such as reducing traffic congestion and enabling smarter and greener transportation. The DPS problem is challenging due to its high computation complexity (there exist kn matching possibilities). We develop an efficient and effective approximate matching algorithm based on local search, as well as pruning techniques to further enhance the matching efficiency. The accuracy and efficiency of the dynamic pricing strategy are verified by extensive experiments on real datasets.
We consider the adaptive influence maximization problem: given a network and a budget k, iteratively select k seeds in the network to maximize the expected number of adopters. In the full-adoption feedback model, after selecting each seed, the seed-picker observes all the resulting adoptions. In the myopic feedback model, the seed-picker only observes whether each neighbor of the chosen seed adopts. Motivated by the extreme success of greedy-based algorithms/heuristics for influence maximization, we propose the concept of greedy adaptivity gap, which compares the performance of the adaptive greedy algorithm to its non-adaptive counterpart. Our first result shows that, for submodular influence maximization, the adaptive greedy algorithm can perform up to a (1-1/e)-fraction worse than the non-adaptive greedy algorithm, and that this ratio is tight. More specifically, on one side we provide examples where the performance of the adaptive greedy algorithm is only a (1-1/e) fraction of the performance of the non-adaptive greedy algorithm in four settings: for both feedback models and both the independent cascade model and the linear threshold model. On the other side, we prove that in any submodular cascade, the adaptive greedy algorithm always outputs a (1-1/e)-approximation to the expected number of adoptions in the optimal non-adaptive seed choice. Our second result shows that, for the general submodular cascade model with full-adoption feedback, the adaptive greedy algorithm can outperform the non-adaptive greedy algorithm by an unbounded factor. Finally, we propose a risk-free variant of the adaptive greedy algorithm that always performs no worse than the non-adaptive greedy algorithm.
We consider the problem of Named Entity Recognition (NER) on biomedical scientific literature, and more specifically the genomic variants recognition in this work. Significant success has been achieved for NER on canonical tasks in recent years where large data sets are generally available. However, it remains a challenging problem on many domain-specific areas, especially the domains where only small gold annotations can be obtained. In addition, genomic variant entities exhibit diverse linguistic heterogeneity, differing much from those that have been characterized in existing canonical NER tasks. The state-of-the-art machine learning approaches heavily rely on arduous feature engineering to characterize those unique patterns. In this work, we present the first successful end-to-end deep learning approach to bridge the gap between generic NER algorithms and low-resource applications through genomic variants recognition. Our proposed model can result in promising performance without any hand-crafted features or post-processing rules. Our extensive experiments and results may shed light on other similar low-resource NER applications.
Effective modeling of electronic health records (EHR) is rapidly becoming an important topic in both academia and industry. A recent study showed that using the graphical structure underlying EHR data (e.g. relationship between diagnoses and treatments) improves the performance of prediction tasks such as heart failure prediction. However, EHR data do not always contain complete structure information. Moreover, when it comes to claims data, structure information is completely unavailable to begin with. Under such circumstances, can we still do better than just treating EHR data as a flat-structured bag-of-features? In this paper, we study the possibility of jointly learning the hidden structure of EHR while performing supervised prediction tasks on EHR data. Specifically, we discuss that Transformer is a suitable basis model to learn the hidden EHR structure, and propose Graph Convolutional Transformer, which uses data statistics to guide the structure learning process. The proposed model consistently outperformed previous approaches empirically, on both synthetic data and publicly available EHR data, for various prediction tasks such as graph reconstruction and readmission prediction, indicating that it can serve as an effective general-purpose representation learning algorithm for EHR data.
Rare diseases affect hundreds of millions of people worldwide but are hard to detect since they have extremely low prevalence rates (varying from 1/1,000 to 1/200,000 patients) and are massively underdiagnosed. How do we reliably detect rare diseases with such low prevalence rates? How to further leverage patients with possibly uncertain diagnosis to improve detection? In this paper, we propose a Complementary pattern Augmentation (CONAN) framework for rare disease detection. CONAN combines ideas from both adversarial training and max-margin classification. It first learns self-attentive and hierarchical embedding for patient pattern characterization. Then, we develop a complementary generative adversarial networks (GAN) model to generate candidate positive and negative samples from the uncertain patients by encouraging a max-margin between classes. In addition, CONAN has a disease detector that serves as the discriminator during the adversarial training for identifying rare diseases. We evaluated CONAN on two disease detection tasks. For low prevalence inflammatory bowel disease (IBD) detection, CONAN achieved .96 precision recall area under the curve (PR-AUC) and 50.1% relative improvement over the best baseline. For rare disease idiopathic pulmonary fibrosis (IPF) detection, CONAN achieves .22 PR-AUC with 41.3% relative improvement over the best baseline.
Machine learning is increasingly becoming a powerful tool to make decisions in a wide variety of applications, such as medical diagnosis and autonomous driving. Privacy concerns related to the training data and unfair behaviors of some decisions with regard to certain attributes (e.g., sex, race) are becoming more critical. Thus, constructing a fair machine learning model while simultaneously providing privacy protection becomes a challenging problem. In this paper, we focus on the design of classification model with fairness and differential privacy guarantees by jointly combining functional mechanism and decision boundary fairness. In order to enforce ϵ-differential privacy and fairness, we leverage the functional mechanism to add different amounts of Laplace noise regarding different attributes to the polynomial coefficients of the objective function in consideration of fairness constraint. We further propose an utility-enhancement scheme, called relaxed functional mechanism by adding Gaussian noise instead of Laplace noise, hence achieving (ϵ, δ)-differential privacy. Based on the relaxed functional mechanism, we can design (ϵ, δ)-differentially private and fair classification model. Moreover, our theoretical analysis and empirical results demonstrate that our two approaches achieve both fairness and differential privacy while preserving good utility and outperform the state-of-the-art algorithms.
The Optimal Power Flow (OPF) problem is a fundamental building block for the optimization of electrical power systems. It is nonlinear and nonconvex and computes the generator setpoints for power and voltage, given a set of load demands. It is often solved repeatedly under various conditions, either in real-time or in large-scale studies. This need is further exacerbated by the increasing stochasticity of power systems due to renewable energy sources in front and behind the meter. To address these challenges, this paper presents a deep learning approach to the OPF. The learning model exploits the information available in the similar states of the system (which is commonly available in practical applications), as well as a dual Lagrangian method to satisfy the physical and engineering constraints present in the OPF. The proposed model is evaluated on a large collection of realistic medium-sized power systems. The experimental results show that its predictions are highly accurate with average errors as low as 0.2%. Additionally, the proposed approach is shown to improve the accuracy of the widely adopted linear DC approximation by at least two orders of magnitude.
Molecule optimization is about generating molecule Y with more desirable properties based on an input molecule X. The state-of-the-art approaches partition the molecules into a large set of substructures S and grow the new molecule structure by iteratively predicting which substructure from S to add. However, since the set of available substructures S is large, such an iterative prediction task is often inaccurate especially for substructures that are infrequent in the training data. To address this challenge, we propose a new generating strategy called “Copy&Refine” (CORE), where at each step the generator first decides whether to copy an existing substructure from input X or to generate a new substructure, then the most promising substructure will be added to the new molecule. Combining together with scaffolding tree generation and adversarial training, CORE can significantly improve several latest molecule optimization methods in various measures including drug likeness (QED), dopamine receptor (DRD2) and penalized LogP. We tested CORE and baselines using the ZINC database and CORE obtained up to 11% and 21% relatively improvement over the baselines on success rate on the complete test set and the subset with infrequent substructures, respectively.
The automatic style translation of Chinese characters (CH-Char) is a challenging problem. Different from English or general artistic style transfer, Chinese characters contain a large number of glyphs with the complicated content and characteristic style. Early methods on CH-Char synthesis are inefficient and require manual intervention. Recently some GAN-based methods are proposed for font generation. The supervised GAN-based methods require numerous image pairs, which is difficult for many chirography styles. In addition, unsupervised methods often cause the blurred and incorrect strokes. Therefore, in this work, we propose a three-stage Generative Adversarial Network (GAN) architecture for multi-chirography image translation, which is divided into skeleton extraction, skeleton transformation and stroke rendering with unpaired training data. Specifically, we first propose a fast skeleton extraction method (ENet). Secondly, we utilize the extracted skeleton and the original image to train a GAN model, RNet (a stroke rendering network), to learn how to render the skeleton with stroke details in target style. Finally, the pre-trained model RNet is employed to assist another GAN model, TNet (a skeleton transformation network), to learn to transform the skeleton structure on the unlabeled skeleton set. We demonstrate the validity of our method on two chirography datasets we established.
Modeling student knowledge is critical in adaptive learning environments. Predictive student modeling enables formative assessment of student knowledge and skills, and it drives personalized support to create learning experiences that are both effective and engaging. Traditional approaches to predictive student modeling utilize features extracted from students’ interaction trace data to predict student test performance, aggregating student test performance as a single output label. We reformulate predictive student modeling as a multi-task learning problem, modeling questions from student test data as distinct “tasks.” We demonstrate the effectiveness of this approach by utilizing student data from a series of laboratory-based and classroom-based studies conducted with a game-based learning environment for microbiology education, Crystal Island. Using sequential representations of student gameplay, results show that multi-task stacked LSTMs with residual connections significantly outperform baseline models that do not use the multi-task formulation. Additionally, the accuracy of predictive student models is improved as the number of tasks increases. These findings have significant implications for the design and development of predictive student models in adaptive learning environments.
Personalized trip recommendation tries to recommend a sequence of point of interests (POIs) for a user. Most of existing studies search POIs only according to the popularity of POIs themselves. In fact, the routes among the POIs also have attractions to visitors, and some of these routes have high popularity. We term this kind of route as Attractive Route (AR), which brings extra user experience. In this paper, we study the attractive routes to improve personalized trip recommendation. To deal with the challenges of discovery and evaluation of ARs, we propose a personalized Trip Recommender with POIs and Attractive Route (TRAR). It discovers the attractive routes based on the popularity and the Gini coefficient of POIs, then it utilizes a gravity model in a category space to estimate the rating scores and preferences of the attractive routes. Based on that, TRAR recommends a trip with ARs to maximize user experience and leverage the tradeoff between the time cost and the user experience. The experimental results show the superiority of TRAR compared with other state-of-the-art methods.
The failure of landing a job for college students could cause serious social consequences such as drunkenness and suicide. In addition to academic performance, unconscious biases can become one key obstacle for hunting jobs for graduating students. Thus, it is necessary to understand these unconscious biases so that we can help these students at an early stage with more personalized intervention. In this paper, we develop a framework, i.e., MAYA (Multi-mAjor emploYment stAtus) to predict students' employment status while considering biases. The framework consists of four major components. Firstly, we solve the heterogeneity of student courses by embedding academic performance into a unified space. Then, we apply a generative adversarial network (GAN) to overcome the class imbalance problem. Thirdly, we adopt Long Short-Term Memory (LSTM) with a novel dropout mechanism to comprehensively capture sequential information among semesters. Finally, we design a bias-based regularization to capture the job market biases. We conduct extensive experiments on a large-scale educational dataset and the results demonstrate the effectiveness of our prediction framework.
Many social and economic systems can be represented as attributed networks encoding the relations between entities who are themselves described by different node attributes. Finding anomalies in these systems is crucial for detecting abuses such as credit card frauds, web spams or network intrusions. Intuitively, anomalous nodes are defined as nodes whose attributes differ starkly from the attributes of a certain set of nodes of reference, called the context of the anomaly. While some methods have proposed to spot anomalies locally, globally or within a community context, the problem remain challenging due to the multi-scale composition of real networks and the heterogeneity of node metadata. Here, we propose a principled way to uncover outlier nodes simultaneously with the context with respect to which they are anomalous, at all relevant scales of the network. We characterize anomalous nodes in terms of the concentration retained for each node after smoothing specific signals localized on the vertices of the graph. Besides, we introduce a graph signal processing formulation of the Markov stability framework used in community detection, in order to find the context of anomalies. The performance of our method is assessed on synthetic and real-world attributed networks and shows superior results concerning state of the art algorithms. Finally, we show the scalability of our approach in large networks employing Chebychev polynomial approximations.
Correcting arithmetical exercise is a labor intensive and time consuming task for primary school teachers all the time. To reduce their burdens, we propose Arithmetical Exercise Checker (AEC), which is the first system that automatically evaluates all arithmetical expressions (AEs) on exercise images. The major challenge is that AE is formed by printed and handwritten texts with particular arithmetical patterns (e.g., multi-line, fraction). Despite being part of AE, handwritten texts usually lead to zigzag boundaries and tangled rows. What's worse, AE may be arithmetical incorrect, which makes the contextual information less valuable for recognition. To tackle these problems, we introduce integrated detection, recognition and evaluation branches by leveraging AE's intrinsic features, namely 1) boundary indistinctive, 2) locally relevant patterns and 3) globally irrelevant symbols. Experimental results demonstrate that AEC yields a 93.72% correction accuracy on 40 kinds of mainstream primary arithmetical exercises. So far, the online service of AEC processes 75, 000 arbitrary exercises on average per day, and already reduced the burden of over 1, 000, 000 users. AEC shows the benefits for implementing an vision-based system as a way to aid teachers in reducing reduplicative tasks.
Pairwise learning has received much attention recently as it is more capable of modeling the relative relationship between pairs of samples. Many machine learning tasks can be categorized as pairwise learning, such as AUC maximization and metric learning. Existing techniques for pairwise learning all fail to take into consideration a critical issue in their design, i.e., the protection of sensitive information in the training set. Models learned by such algorithms can implicitly memorize the details of sensitive information, which offers opportunity for malicious parties to infer it from the learned models. To address this challenging issue, in this paper, we propose several differentially private pairwise learning algorithms for both online and offline settings. Specifically, for the online setting, we first introduce a differentially private algorithm (called OnPairStrC) for strongly convex loss functions. Then, we extend this algorithm to general convex loss functions and give another differentially private algorithm (called OnPairC). For the offline setting, we also present two differentially private algorithms (called OffPairStrC and OffPairC) for strongly and general convex loss functions, respectively. These proposed algorithms can not only learn the model effectively from the data but also provide strong privacy protection guarantee for sensitive information in the training set. Extensive experiments on real-world datasets are conducted to evaluate the proposed algorithms and the experimental results support our theoretical analysis.
Adverse drug-drug interactions (DDIs) remain a leading cause of morbidity and mortality. Identifying potential DDIs during the drug design process is critical for patients and society. Although several computational models have been proposed for DDI prediction, there are still limitations: (1) specialized design of drug representation for DDI predictions is lacking; (2) predictions are based on limited labelled data and do not generalize well to unseen drugs or DDIs; and (3) models are characterized by a large number of parameters, thus are hard to interpret. In this work, we develop a ChemicAl SubstrucTurE Representation (CASTER) framework that predicts DDIs given chemical structures of drugs. CASTER aims to mitigate these limitations via (1) a sequential pattern mining module rooted in the DDI mechanism to efficiently characterize functional sub-structures of drugs; (2) an auto-encoding module that leverages both labelled and unlabelled chemical structure data to improve predictive accuracy and generalizability; and (3) a dictionary learning module that explains the prediction via a small set of coefficients which measure the relevance of each input sub-structures to the DDI outcome. We evaluated CASTER on two real-world DDI datasets and showed that it performed better than state-of-the-art baselines and provided interpretable predictions.
This paper presents a deep reinforcement learning algorithm for online accompaniment generation, with potential for real-time interactive human-machine duet improvisation. Different from offline music generation and harmonization, online music accompaniment requires the algorithm to respond to human input and generate the machine counterpart in a sequential order. We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state). The key of this algorithm is the well-functioning reward model. Instead of defining it using music composition rules, we learn this model from monophonic and polyphonic training data. This model considers the compatibility of the machine-generated note with both the machine-generated context and the human-generated context. Experiments show that this algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part. Subjective evaluations on preferences show that the proposed algorithm generates music pieces of higher quality than the baseline method.
Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads' origin – an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posterior probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques. Source codes, datasets and supplementary document are available at https://github.com/WuLoli/GAEseq.
We propose an approach to generate realistic and high-fidelity stock market data based on generative adversarial networks (GANs). Our Stock-GAN model employs a conditional Wasserstein GAN to capture history dependence of orders. The generator design includes specially crafted aspects including components that approximate the market's auction mechanism, augmenting the order history with order-book constructions to improve the generation task. We perform an ablation study to verify the usefulness of aspects of our network structure. We provide a mathematical characterization of distribution learned by the generator. We also propose statistics to measure the quality of generated orders. We test our approach with synthetic and actual market data, compare to many baseline generative models, and find the generated data to be close to real data.
An open research problem in automatic signature verification is the skilled forgery attacks. However, the skilled forgeries are very difficult to acquire for representation learning. To tackle this issue, this paper proposes to learn dynamic signature representations through ranking synthesized signatures. First, a neuromotor inspired signature synthesis method is proposed to synthesize signatures with different distortion levels for any template signature. Then, given the templates, we construct a lightweight one-dimensional convolutional network to learn to rank the synthesized samples, and directly optimize the average precision of the ranking to exploit relative and fine-grained signature similarities. Finally, after training, fixed-length representations can be extracted from dynamic signatures of variable lengths for verification. One highlight of our method is that it requires neither skilled nor random forgeries for training, yet it surpasses the state-of-the-art by a large margin on two public benchmarks.
Machine learning and data mining techniques are increasingly being applied to electronic health record (EHR) data to discover underlying patterns and make predictions for clinical use. For instance, these data may be evaluated to predict clinical deterioration events such as cardiopulmonary arrest or escalation of care to the intensive care unit (ICU). In clinical practice, early warning systems with multiple time horizons could indicate different levels of urgency, allowing clinicians to make decisions regarding triage, testing, and interventions for patients at risk of poor outcomes. These different horizon alerts are related and have intrinsic dependencies, which elicit multi-task learning. In this paper, we investigate approaches to properly train deep multi-task models for predicting clinical deterioration events via generating multi-horizon alerts for hospitalized patients outside the ICU, with particular application to oncology patients. Prior knowledge is used as a regularization to exploit the positive effects from the task relatedness. Simultaneously, we propose task-specific loss balancing to reduce the negative effects when optimizing the joint loss function of deep multi-task models. In addition, we demonstrate the effectiveness of the feature-generating techniques from prediction outcome interpretation. To evaluate the model performance of predicting multi-horizon deterioration alerts in a real world scenario, we apply our approaches to the EHR data from 20,700 hospitalizations of adult oncology patients. These patients' baseline high-risk status provides a unique opportunity: the application of an accurate model to an enriched population could produce improved positive predictive value and reduce false positive alerts. With our dataset, the model applying all proposed learning techniques achieves the best performance compared with common models previously developed for clinical deterioration warning.