Yuan Tian

I'm

About Me

I'm Yuan Tian and

I'm an Assistant Professor in the School of Computing at Queen's University. Before joining Queen's, I was a data scientist at the Living Analytics Research Centre (LARC), Singapore Management University (SMU). I received my Ph.D. degree in Information Systems from SMU in May 2017 and a bachelor's degree in Computer Science from Chu Kochen Honors College of Zhejiang University in 2012. I visited Carnegie Mellon University in 2015 and Inria Paris in 2013.


My research focuses on three domains: Data Mining, Software Engineering, and Social Science. My long-term research goal is to help people gain insights from messy data stored in all kinds of software repositories and propose context-aware data-driven approaches to improve the efficiency and capabilities of various stakeholders in software development process.

I am looking for self-motivated master/PhD. students who are excited about data mining and software engineering.

Current Research Focus

Software is eating the world: almost every aspect of modern life depends on the reliable operation of high-quality software such as Google, Uber, Amazon, Facebook, etc. Consequently, the number of software projects grows rapidly, and the scale of available code becomes massive, so-called “big code”, with billions of lines of code.


New intelligent methods are constantly sought to reduce the complexity of software, help engineers understand the code, and construct high quality software in a more efficient way.


Intelligent Code Reuse - code search, code summarization, data science code mining95%
Intelligent API Documentation - REST API documentation, Document2QAbot85%
Explainable and Robust Recommender Systems - SE bots90%
Sentiment and Emotion Analysis in Software Engineering - reactions, emojis60%
Evolution of Software Ecosystem - human factors, collaborations, developer onboarding process50%

Selected Research Topics

  • APIBot

    To addresses the daunting task of finding information about APIs, we constructs a question answering (QA) bot called APIBot on top of a general-purpose QA system SiriusQA. An empirical evaluation of APIBot on 92 API questions showed that APIBot can achieve at Hit@5 score of 0.706 (i.e. the correct answer is among the top five answers returned about 70% of the time). Refer to our ASE17 paper for the details.

  • Automated Bug Triage

    Software systems are often released with bugs due to system complexity and inadequate testing. We have shown how historical bug data in bug tracking system could help with automated bug triage process, including duplicate bug report detection (refer to our CSMR12 paper), bug report prioritization (refer to our EMSE15 paper), and bug report assignment (refer to our ICPC16 paper).

  • Mining Software Engineering Trends on Twitter

    By analyzing microblogs, one could get the real-time information about what people are interested in or feel toward a particular topic. To support developers in collecting software engineering related content, we build a microblog observatory that aggregates more than 58,000 Twitter feeds, captures software-related tweets, and computes trends from across topics and time points (refer to our MSR12 tool paper). We also applied latest event detection algorithm to find hot topics related to software engineering on Twitter (refer to our ICSME15 paper).

  • Identifying Software Experts on Twitter

    We proposed a recommendation system to identify specialized software gurus. We have investigated the effectiveness of our approach in finding specialized software gurus for four different domains (JavaScript, Android, Python, and Linux) on a dataset of 86,824 Twitter users who generate 5,517,878 tweets over 1 month. Our approach can differentiate specialized software experts from other domain-related Twitter users with an F-Measure of up to 0.820.

  • On the Unreliability of Bug Severity Data

    Looking at duplicate bug reports (i.e., reports referring to the same problem) from three open-source software systems (OpenOffice, Mozilla, and Eclipse), we find that around 51% of the duplicate bug reports have inconsistent human-assigned severity labels even though they refer to the same software problem. While our results do indicate that duplicate bug reports have unreliable severity labels, we believe that they send warning signals about the reliability of the full bug severity data (refer to our EMSE17 paper).

Latest News

Check out our latest News
  • Nov 2019 Paper Accepted!

    In this paper,we propose PatchNet, a hierarchical deep learning-based approach capable of automatically extracting features from commit messages and commit code and using them to identify stable patches. PatchNet contains a deep hierarchical structure that mirrors the hierarchical and sequential structure of commit code, making it distinctive from the existing deep learning models on source code. Experiments on 82,403 recent Linux patches confirm the superiority of PatchNet against various state-of-the-art baselines, including the one recently-adopted by Linux kernel maintainers.

  • April 2019 Grant Awarded!

    Awarded a Discovery grant ($33,000/year for 5 years) and a Discovery Launch Supplement grant of $12,500. Thanks NSERC/CRSNG for supporting our research! The title of the accepted proposal is "Reliable and Explainable Recommender Systems for Efficient Software Development"

  • Queen’s as one of the top universities in Canada, has a long history of discovery and innovation that has shaped our knowledge and helped to address some of the world’s deepest mysteries and most pressing questions. For more than 175 years, Queen’s has brought together and built synergies among leading researchers, scholars and innovators making a real and measured impact.

Teaching

Check out recent courses offered by me
  • Fall and Winter 2019 CISC 235

    CISC 235 Data Structures introduce design and implementation of advanced data structures and related algorithms...

    Data Structures introduce design and implementation of advanced data structures and related algorithms, including correctness and complexity analysis. Efficient implementation of lists, sets, dictionaries, priority queues, trees, graphs, and networks using arrays, hash tables, heaps, and hierarchical linked structures. String and graph problems, such as string matching and shortest path. External storage and input-output complexity.

  • Fall 2019 CISC 880

    CISC 880 Mining Software Repositories introduces the state-of-the-art data mining techniques that could be applied to analyze large software data for understanding of software development practices ...

    Software engineering data (such as code bases, execution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project's status and history. This course will introduce state-of-the-art data mining techniques (including deep learning) that could be applied to analyze large software data for understanding of software development practices, and utilize software data for intelligent software development.

  • Winter 2019 CISC 351

    CISC 351 Advanced Data Analysis introduces design and implementation of complex analytics techniques...

    Advanced Data Analysis introduces design and implementation of complex analytics techniques; predictive algorithms at scale; deep learning; analytics in the Web; social network analysis; text analysis; recommender systems, and applications in specialized domains.

Contact Me

I am looking for self-motivated master/PhD. students who are excited about data mining and software engineering.