| Total: 13
Sarcasm Suite is a browser-based engine that deploys five of our past papers in sarcasm detection and generation. The sarcasm detection modules use four kinds of incongruity: sentiment incongruity, semantic incongruity, historical context incongruity and conversational context incongruity. The sarcasm generation module is a chatbot that responds sarcastically to user input. With a visually appealing interface that indicates predictions using `faces' of our co-authors from our past papers, Sarcasm Suite is our first demonstration of our work in computational sarcasm.
In this paper, we present a demo, AniDraw, which can help users practice the coordination between their hands, mouth and eyes by combing the elements of music, painting and dance. Users can sketch a cartoon character through multitouch screens and then hum songs, which will drive the cartoon character to dance to create a lively animation. In technical realization, we apply the mechanism of acoustic driving in which AniDraw extracts time-domain acoustic features to map to the intensity of dances, frequency-domain ones to map to the style of dances, and high-level ones including onesets and tempos to map to start, duration and speed of dances. AniDraw can not only stimulate users’ enthusiasm in artistic creation, but also enhance their esthetic ability on harmony.
Advances in deep reinforcement learning have allowed autonomous agents to perform well on Atari games, often outperforming humans, using only raw pixels to make their decisions. However, most of these games take place in 2D environments that are fully observable to the agent. In this paper, we present Arnold, a completely autonomous agent to play First-Person Shooter Games using only screen pixel data and demonstrate its effectiveness on Doom, a classical first-person shooter game. Arnold is trained with deep reinforcement learning using a recent Action-Navigation architecture, which uses separate deep neural networks for exploring the map and fighting enemies. Furthermore, it utilizes a lot of techniques such as augmenting high-level game features, reward shaping and sequential updates for efficient training and effective performance. Arnold outperforms average humans as well as in-built game bots on different variations of the deathmatch. It also obtained the highest kill-to-death ratio in both the tracks of the Visual Doom AI Competition and placed second in terms of the number of frags.
In this demo, we develop a mobile running application, SenseRun, to involve landscape experiences for routes recommendation. We firstly define landscape experiences, perceived enjoyment from landscape as motivators for running, by public natural area and traffic density. Based on landscape experiences, we categorize locations into 3 types (natural, leisure, traffic space) and set them with different basic weight. Real-time context factors (weather, season and hour of the day) are involved to adjust the weight. We propose a multi-attributes method to recommend routes with weight based on MVT (The Marginal Value Theorem) k-shortest-paths algorithm. We also use a landscape-awareness sounds algorithm as supplementary of landscape experiences. Experimental results improve that SenseRun can enhance running experiences and is helpful to promote regular physical activities.
Besides fashion, personalization is another important factor of wearing. How to balance fashion trend and personal preference to better appreciate wearing is a non-trivial task. In previous work we develop a demo, Magic Mirror, to recommend clothing collocation based on the fashion trend. However, the diversity of people’s aesthetics is huge. In order to meet different demand, Magic Mirror is upgraded in this paper, and it can give out recommendations by considering both the fashion trend and personal preference, and work as a private clothing consultant. For more suitable recommendation, the virtual consultant will learn users’ tastes and preferences from their behaviors by using Genetic algorithm. Users can get collocations or matched top/bottom recommendation after choosing occasion and style. They can also get a report about their fashion state and aesthetic standpoint on recent wearing.
The boom of mobile devices and cloud services has led to an explosion of personal photo and video data. However, due to the missing user-generated metadata such as titles or descriptions, it usually takes a user a lot of swipes to find some video on the cell phone. To solve the problem, we present an innovative idea called Visual Memory QA which allow a user not only to search but also to ask questions about her daily life captured in the personal videos. The proposed system automatically analyzes the content of personal videos without user-generated metadata, and offers a conversational interface to accept and answer questions. To the best of our knowledge, it is the first to answer personal questions discovered in personal photos or videos. The example questions are "what was the lat time we went hiking in the forest near San Francisco?"; "did we have pizza last week?"; "with whom did I have dinner in AAAI 2015?".
In this work, we present a dynamic response spoken dialogue system (DRSDS). It is capable of understanding the verbal and nonverbal language of users and making instant, situation-aware response. Incorporating with two external systems, MultiSense and email summarization, we built an email reading agent on mobile device to show the functionality of DRSDS.
Today's operation of buildings is either based on simple dashboards that are not scalable to thousands of sensor data or on rules that provide very limited fault information only. In either case considerable manual effort is required for diagnosing building operation problems related to energy usage or occupant comfort. We present a Cognitive Building demo that uses (i) semantic reasoning to model physical relationships of sensors and systems, (ii) machine learning to predict and detect anomalies in energy flow, occupancy and user comfort, and (iii) speech-enabled Augmented Reality interfaces for immersive interaction with thousands of devices. Our demo analyzes data from more than 3,300 sensors and shows how we can automatically diagnose building operation problems.
What happened during the Boston Marathon in 2013? Nowadays, at any major event, lots of people take videos and share them on social media. To fully understand exactly what happened in these major events, researchers and analysts often have to examine thousands of these videos manually. To reduce this manual effort, we present an investigative system that automatically synchronizes these videos to a global timeline and localizes them on a map. In addition to alignment in time and space, our system combines various functions for analysis, including gunshot detection, crowd size estimation, 3D reconstruction and person tracking. To our best knowledge, this is the first time a unified framework has been built for comprehensive event reconstruction for social media videos.
Given any complicated or specialized video content search query, e.g. ”Batkid (a kid in batman costume)” or ”destroyed buildings”, existing methods require manually labeled data to build detectors for searching. We present a demonstration of an artificial intelligence application, Webly-labeled Learning (WELL) that enables learning of ad-hoc concept detectors over unlimited Internet videos without any manual an-notations. A considerable number of videos on the web are associated with rich but noisy contextual information, such as the title, which provides a type of weak annotations or la-bels of the video content. To leverage this information, our system employs state-of-the-art webly-supervised learning(WELL) (Liang et al. ). WELL considers multi-modal information including deep learning visual, audio and speech features, to automatically learn accurate video detectors based on the user query. The learned detectors from a large number of web videos allow users to search relevant videos over their personal video archives, not requiring any textual metadata,but as convenient as searching on Youtube.
We demonstrate an integrated system for building and learning models and structures in both a real and virtual environment. The system combines natural language understanding, planning, and methods for composition of basic concepts into more complicated concepts. The user and the system interact via natural language to jointly plan and execute tasks involving building structures, with clarifications and demonstrations to teach the system along the way. We use the same architecture for building and simulating models of biology, demonstrating the general-purpose nature of the system where domain-specific knowledge is concentrated in sub-modules with the basic interaction remaining domain-independent. These capabilities are supported by our work on semantic parsing, which generates knowledge structures to be grounded in a physical representation, and composed with existing knowledge to create a dynamic plan for completing goals. Prior work on learning from natural language demonstrations enables learning of models from very few demonstrations, and features are extracted from definitions in natural language. We believe this architecture for interaction opens up a wide possibility of human-computer interaction and knowledge transfer through natural language.
Automatic identification of clinical concepts in electronic medical records (EMR) is useful not only in forming a complete longitudinal health record of patients, but also in recovering missing codes for billing, reducing costs, finding more accurate clinical cohorts for clinical trials, and enabling better clinical decision support. Existing systems for clinical concept extraction are mostly knowledge-driven, relying on exact match retrieval from original or lemmatized reports, and very few of them are scaled up to handle large volumes of complex, diverse data. In this demonstration we will showcase a new system for real-time detection of clinical concepts in EMR. The system features a large vocabulary of over 5.6 million concepts. It achieves high precision and recall, with good tolerance to typos through the use of a novel prefix indexing and subsequence matching algorithm, along with a recursive negation detector based on efficient, deep parsing. Our system has been tested on over 12.9 million reports of more than 200 different types, collected from 800,000+ patients. A comparison with the state of the art shows that it outperforms previous systems in addition to being the first system to scale to such large collections.
Computer dialogue systems are designed with the intention of supporting meaningful interactions with humans. Common modes of communication include speech, text, and physical gestures. In this work we explore a communication paradigm in which the input and output channels consist of music. Specifically, we examine the musical interaction scenario of call and response. We present a system that utilizes a deep autoencoder to learn semantic embeddings of musical input. The system learns to transform these embeddings in a manner such that reconstructing from these transformation vectors produces appropriate musical responses. In order to generate a response the system employs a combination of generation and unit selection. Selection is based on a nearest neighbor search within the embedding space and for real-time application the search space is pruned using vector quantization. The live demo consists of a person playing a midi keyboard and the computer generating a response that is played through a loudspeaker.