The Use of Text Retrieval and Natural Language Processing in Software Engineering

During software evolution many related artifacts are created or modified. Some of these are composed of structured data (e.g., analysis data), some contain semi-structured information (e.g., source code), and many include unstructured information (e.g., natural language text). Software artifacts written in natural language (e.g., requirements, bug reports, etc.), together with the comments and identifiers in the source code encode to a large degree the domain of the software, the developers’ knowledge about the system, capture design decisions, developer information, etc. Retrieving and analyzing the textual information existing in software is extremely important for supporting a variety of Software Engineering (SE) tasks.

Text Retrieval (TR) is a branch of Information Retrieval (IR) that leverages information stored primarily in the form of text. TR methods have been proved as suitable candidates for the retrieval and the analysis of textual data embedded in software or present in other sources. TR techniques treat text as bag of words. Thus, they are often used in conjunction with Natural Language Processing (NLP) tools to analyze, for example, the structure of sentences, the meaning of words, etc.

The course will start with introducing the background on TR and NLP techniques and tools. Next, we will review and discuss the application of TR and NLP in different SE tasks. The course is focuses on research articles; no textbook is required.

Course Name: The Use of Text Retrieval and Natural Language Processing in Software Engineering
Course Number: Cpt S 580
Credits: 3
Semester: Spring 2016
Prerequisites: Graduate standing.
Course required/elective: elective.

Schedule: Tu Thu 2:50pm – 4:05pm
Location: Sloan Hall 7
Course webpage: http://www.veneraarnaoudova.ca/cpt-s-580-tr-and-nlp-in-se/
Professors/Coordinators: Venera Arnaoudova.
Office: EME 127.
Office hours: By e-mail appointment.

Resources:

 Recommended textbook(s):

[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
[2] Dan Jurafsky and James H. Martin, Speech and Language Processing, 2nd Ed., 2007.

Journal/conference papers:

[3] S. Keshav. 2007. How to read a paper. SIGCOMM Comput. Commun. Rev. 37, 3 (July 2007), 83-84.
[4] Mary Shaw. 2003. Writing good software engineering research papers: minitutorial. In Proceedings of the 25th International Conference on Software Engineering (ICSE ’03). IEEE Computer Society, 726-736.
[5] B.A. Kitchenham, S. Charters, Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report EBSE-2007-01, 2007.
[6] S. L. Abebe, S. Haiduc, P. Tonella, and A. Marcus. Lexicon bad smells in software. In Proceedings of the Working Conference on Reverse Engineering (WCRE), pages 95–99, 2009.
[7] V. Arnaoudova, M. Di Penta, and G. Antoniol. “Linguistic antipatterns: What they are and how developers perceive them” Empirical Software Engineering (EMSE), pages 1–55, 2015.
[8] S. L. Abebe, P. Tonella, “Towards the Extraction of Domain Concepts from the Identifiers” in 18th Working Conference on Reverse Engineering (WCRE), 2011, pp. 77-86.
[9] Matthew J. Howard, Samir Gupta, Lori Pollock, and K. Vijay-Shanker. Automatically mining software-based, semantically-similar words from comment-code mappings. In Proceedings of the Working Conference on Mining Software Repositories (MSR), pages 377-386, 2013.
[10] Haiduc, S.; Aponte, J.; Moreno, L.; Marcus, A., “On the Use of Automated Text Summarization Techniques for Summarizing Source Code,” in Proceedings of the Working Conference on Reverse Engineering (WCRE), pp.35-44, 2010.
[11] L. Moreno, G. Bavota, M. Di Penta, R. Oliveto, A. Marcus, and G. Canfora. “Automatic generation of release notes“. In Proceedings of the ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2014, pp. 484–495.
[12] Linares-Vasquez, M.; Hossen, K.; Dang, H.; Kagdi, H.; Gethers, M.; Poshyvanyk, D. “Triaging Incoming Change Requests: Bug or Commit History, or Code Authorship?” in 28th IEEE International Conference on Software Maintenance (ICSM), 2012, pp. 451-460.
[13] Runeson, P., Alexandersson, M., and Nyholm, O., “Detection of Duplicate Defect Reports Using Natural Language Processing“, in Proceedings of the International Conference on Software Engineering (ICSE), 2007, pp. 499-510.
[14] Poshyvanyk, D., Gael-Gueheneuc, Y., Marcus, A., Antoniol, G., and Rajlich, V., “Feature Location using Probabilistic Ranking of Methods based on Execution Scenarios and Information Retrieval“, IEEE Transactions on Software Engineering, vol. 33, no. 6, June 2007.
[15] Moreno, L.; Treadway, J. J.; Marcus, A. & Shen, W. “On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization” in IEEE International Conference on Software Maintenance and Evolution, IEEE, 2014, pp. 151-160.
[16] A. Marcus and J. Maletic, “Recovering documentation-to-source-code traceability links using latent semantic indexing,” in Proceedings of the 25th International Conference on Software Engineering, 2003. IEEE, May 2003, pp. 125–135.
[17] Panichella, A. and McMillan, C. and Moritz, E. and Palmieri, D. and Oliveto, R. and Poshyvanyk, D. and De Lucia, A., “When and How Using Structural Information to Improve IR-Based Traceability Recovery“, in Proceedings of the European Conference on Software Maintenance and Reengineering (CSMR), 2013, pp. 199-208.
[18] Tian, K., Revelle, M., Poshyvanyk, D., “Using Latent Dirichlet Allocation for Automatic Categorization of Software“, in Proceedings Working Conference on Mining Software Repositories (MSR), 2009, pp.163-166.
[19] Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. Checking app behavior against app descriptions. In Proceedings of the International Conference on Software Engineering (ICSE), 2014, pp. 1025-1035.
[20] Antoniol, G., Hayes, J. H., Gueheneuc, Y.-G., and Di Penta, M., “Reuse or rewrite: Combining textual, static, and dynamic analyses to assess the cost of keeping a system up-to-date“, in Proceedings of the International Conference on Software Maintenance (ICSM), 2008, pp. 147-156.
[21] McMillan, C., Grechanik, M., Poshyvanyk, D., Fu, C., and Xie, Q., “Exemplar: A Source Code Search Engine For Finding Highly Relevant Applications“, IEEE Transactions on Software Engineering (TSE), 2012, 38, pp. 1069 – 1087
[22] Poshyvanyk, D., Marcus, A., Ferenc, R., Gyimóthy, T. “Using Information Retrieval based Coupling Measures for Impact Analysis“, Empirical Software Engineering, Vol. 14, No. 1, February 2009, pp. 5-32.
[23] Gethers, M., Kagdi, H., Dit, B., and Poshyvanyk, D., “An Adaptive Approach to Impact Analysis from Change Requests to Source Code“, in Proceedings of the International Conference on Automated Software Engineering (ASE), 2011, pp. 540-543.

Course description: Application of text retrieval and natural language processing techniques and tools to solve software engineering tasks.

Overview and Course Goals: This course provides basic background on Text Retrieval (TR) and Natural Language Processing (NLP) techniques and tools. It then reviews the application of TR and NLP for different Software Engineering tasks.

Course topics:

– Text Retrieval
– Natural Language Processing
– Software engineering tasks, e.g.,

  • Refactoring
  • Reverse engineering
  • (Re)documentation
  • Concept Location
  • Traceability
  • Software Categorization

Learning outcomes and evaluation:

Students that successfully complete the course will:

  1. Be able to summarize, critique, and present papers applying TR and NLP techniques in Software Engineering.
  2. Be able to perform a systematic literature survey.
  3. Be able to apply TR and NLP techniques to solve a research problem in a different domain.
  4. Improve their ability to critique and present a research paper.
  5. Write a research paper with clear motivation, comparison with existing work, methodology, etc.

Week-by-week schedule[1]:

[1] All submissions are due on the specified date by midnight Pacific Time.

Week Date Topics Resources Deadlines
1 01/12 Syllabus. How to write, read, and present a research paper. Literature survey. [3], [4], [5]
01/14 Introduction to the use of TR and NLP in Software Engineering.
2 01/19 Background on TR methods. Ch. 1, 6, 18 [1]
01/21 Common TR tools.
3 01/26 Background on NLP methods. Ch. 4, 5, 12, 19 [2]
01/28 Common NLP tools.
4 02/02 Preprocessing. Ch. 2 [1] Written project proposal due by February 1st.
02/04 Project proposal presentations. Students presenting next week must send details on the paper by February 4th.
5 02/09 Refactoring. Identifying poor quality identifiers and naming inconsistencies. [6], [7] Groups must be formed and a topic selected by February 9th.
02/11 Student presentations. TBA Students presenting next week must send details on the paper by February 11th.
6 02/16 Reverse Engineering. Building software ontologies. Identifying semantic relations between words. [8], [9]
02/18 Student presentations. TBA Students presenting next week must send details on the paper by February 18th.
7 02/23 (Re)documentation. Extracting a set of important keywords. Generating natural language sentences. [10], [11]
02/25 Student presentations. TBA Students presenting next week must send details on the paper by February 25th.
8 03/01 Bug triage and bug report analysis. [12], [13]
03/03 Student presentations TBA
9 03/08 No class: Students will work on a systematic literature survey during this week. [5]
03/10
10 03/15 No class: WSU Spring vacation.
03/17 Students presenting next week must send details on the paper by March 17th.
11 03/22 Concept location. [14], [15] Literature survey is due by March 21st.
03/24 Student presentations. TBA Students presenting next week must send details on the paper by March 24th.
12 03/29 Traceability link recovery. [16], [17]
03/31 Student presentations. TBA Students presenting next week must send details on the paper by March 31st.
13 04/05 Software categorization. [18], [19]
04/07 Student presentations. TBA Students presenting next week must send details on the paper by April 7th.
14 04/12 Software reuse. [20], [21]
04/14 Student presentations. TBA Students presenting next week must send details on the paper by April 14th.
15 04/19 Change impact analysis. [22], [23]
04/21 Student presentations. TBA
16 04/26 Project presentations.
04/28 Project presentations.

 

Grading framework: Course grades are based on a research project totaling 60% of the final grade, paper presentations totaling 15% of the final grade, a literature survey totaling 15% of the final grade, and participation in class totaling 10% of the final grade.

The research project consists of applying a TR/NLP technique to solve a problem of your choice. This is an individual project that accounts for 60% of the final grade and will be evaluated as follows:
– 5% written project proposal. The submission must use the IEEE template for conference proceedings and must include at minimum an abstract, introduction, related work, and methodology sections.
– 5 % proposal presentation (10 min including questions).
– 40% final project submission. The submission must include a complete paper (a continuation of the project proposal) of 10 pages including references, the source code of the implementation, the data used to evaluate the approach, and a documentation explaining the artifacts and a user manual.
– 10% final project presentation

Paper presentations account for 15% of the final grade.

Systematic literature survey totaling 15%. This is a group work (3-5 students per team). Students will select a software engineering task that is not discussed in class and will perform a systematic literature survey on the application of TR and NLP techniques for the selected task. The submission must include a half a page summary of the way the work was distributed among team members and the role of each team member.

Final grades will be awarded on the following scale:
Interval            Grade
[90,100]          A
[87,90)            A‐
[83,87)            B+
[80,83)            B
[77,80)            B‐
[73,77)            C+
[70,73)            C
[67,70)            C‐
[63,67)            D+
[60,63)            D
[0,60)             F

Course rules:

Unless posted otherwise, assignment documents shall be submitted electronically.

Late penalty is a flat 10% deduction per day. Late assignments may be turned up to one week after the original due date, and an advanced notice must be given to the instructor beforehand for the late submission. No homework will be accepted after its due day without advanced notice or special permission from the instructor.

Bonus points will be added to your total class score for attendance as follows: 0 absence = 5% of the final grade, 1 absence = 4 %, 2 absences = 3%, and 3 or more absences = 0% bonus.

Reasonable Accommodation:

Reasonable accommodations are available for students with a documented disability. If you have a disability and need accommodations to fully participate in this class, please either visit or call the Access Center (Washington Building 217; 509-335-3417) to schedule an appointment with an Access Advisor. All accommodations MUST be approved through the Access Center.

Academic Integrity:

I encourage you to work with classmates on assignments. However, each student must turn in original work. No copying will be accepted. Students who violate WSU’s Standards of Conduct for Students will receive an F as a final grade in this course, will not have the option to withdraw from the course and will be reported to the Office Student Conduct. Cheating is defined in the Standards for Student Conduct WAC 504-26-010 (3). It is strongly suggested that you read and understand these definitions. (Read more: http://apps.leg.wa.gov/wac/default.aspx?cite=504-26-010)

Safety:

Washington State University is committed to maintaining a safe environment for its faculty, staff, and students. Safety is the responsibility of every member of the campus community and individuals should know the appropriate actions to take when an emergency arises. In support of our commitment to the safety of the campus community the University has developed a Campus Safety Plan, http://safetyplan.wsu.edu. It is highly recommended that you visit this web site as well as the University emergency management web site at http://oem.wsu.edu/ to become familiar with the information provided.