School of Computing, Queen's University 535 Goodwin Hall, Kingston, Ontario, Canada K7L2N8
yuan dot tian at cs dot queensu dot ca
I'm an Assistant Professor in the School of Computing at Queen's University, Kingston, Ontario.
I am looking for master/PhD. students who are excited about data mining and software engineering.
Before joining Queen's, I was a data scientist at the Living Analytics Research Centre (LARC), Singapore Management University (SMU). I received my Ph.D. degree in Information Systems from SMU in May 2017 and a bachelor's degree in Computer Science from Zhejiang University in 2012. I visited Carnegie Mellon University in 2015 and Inria Paris in 2013.
My research focuses on three domains:Data Mining, Software Engineering, and Social Science. My short-term research goal is to help people gain insights from messy data stored in all kinds of software repositories and propose context-aware data-driven approaches to improve the efficiency and capabilities of various stakeholders in software development process.
To addresses the daunting task of finding information about APIs, we constructs a question answering (QA) bot called APIBot. We note that applying well-established general-purpose QA systems to API documentation poses three key challenges: (1) API QA process needs to consider domain-specific patterns and software-specific terms, (2) Much semantic of an API documentation is hidden in its implicit structure, (3) general-purpose QA bots require a large amount of manually created training data that is not necessarily available for API documentation. APIBot addresses these challenges by introducing novel components on top of a general-purpose QA system SiriusQA. An empirical evaluation of APIBot on 92 API questions showed that APIBot can achieve at Hit@5 score of 0.706 (i.e. the correct answer is among the top five answers returned about 70% of the time). This paper appears in the ASE17 proceedings. Access it at here.
Software systems are often released with bugs due to system complexity and inadequate testing. To help developers effectively address and manage bugs, bug tracking systems such as Bugzilla and JIRA are adopted to manage the life cycle of a bug through bug report. Since most of the information related to bugs are stored in software repositories, e.g., bug tracking systems, version control repositories, mailing list archives, etc. These repositories contain a wealth of valuable information, which could be mined to automate bug management process and thus save developers time and effort. In the past, we have shown how historical bug data could help with automated bug triage process, including duplicate bug report detection (refer to our CSMR12 paper), bug report prioritization (refer to our EMSE15 paper), and bug report assignment (refer to our ICPC16 paper).
Different from traditional media, microblog users tend to focus on recency and informality of content. Many tweet contents are relatively more personal and opinionated, compared to that of traditional news report. Thus, by analyzing microblogs, one could get the real-time information about what people are interested in or feel toward a particular topic. To support developers in collecting software engineering related content, we build a microblog observatory that aggregates more than 58,000 Twitter feeds, captures software-related tweets, and computes trends from across topics and time points (refer to our MSR12 tool paper). We also applied latest event detection algorithm to find hot topics related to software engineering on Twitter (refer to our ICSME15 paper). To extract software engineering related content from twitter, we perform a preliminary study to investigate the feasibility of automatic classification of microblogs into two categories: relevant and irrelevant to engineering software systems in our MSR12 paper. Following this work, we propose a novel approach named NIRMAL (refer to our SANER15 paper), which automatically identifies software relevant tweets from a collection or stream of tweets based on language modelling. Recently, we propose a new approach to find and rank URLs harvested from Twitter based on their in formativeness and relevance to a domain of interest (refer to our SANER17 paper).