Course Title: Data Preprocessing

Part A: Course Overview

Course Title: Data Preprocessing

Credit Points: 12.00


Course Code




Learning Mode

Teaching Period(s)


City Campus


171H School of Science


Sem 1 2018,
Sem 2 2018,
Sem 1 2019

Course Coordinator: Dr. Anil Dolgun

Course Coordinator Phone: +61 3 9925 2526

Course Coordinator Email:

Course Coordinator Location: 8.9.23

Course Coordinator Availability: By appointment

Pre-requisite Courses and Assumed Knowledge and Capabilities

A working knowledge of basic mathematics and familiarity with computers.

Course Description

Real-world data are commonly incomplete, noisy, and inconsistent. This course will cover a wide range of topics designed to equip you with the skills needed to prepare all forms of untidy data for statistical analysis. The course will cover the core concepts of data preprocessing, namely tidy data, data integration, data cleaning, data transformation, data standardisation, data discretisation, and data reduction. You will develop and apply your data preprocessing skills to complex, noisy, and inconsistent real world data using leading open source software.

Objectives/Learning Outcomes/Capability Development

This course contributes to the following Program Learning Outcomes for MC004 Master of Statistics and Operations Research and MC242 Master of Analytics:


Personal and professional awareness

  • the ability to contextualise outputs where data are drawn from diverse and evolving social, political and cultural dimensions
  • the ability to reflect on experience and improve your own future practice
  • the ability to apply the principles of lifelong learning to any new challenge.

Knowledge and technical competence

  • an understanding of appropriate and relevant, fundamental and applied mathematical and statistical knowledge, methodologies and modern computational tools.


  • the ability to bring together and flexibly apply knowledge to characterise, analyse and solve a wide range of problems
  • an understanding of the balance between the complexity / accuracy of the mathematical / statistical models used and the timeliness of the delivery of the solution.

Information literacy

  • the ability to locate and use data and information and evaluate its quality with respect to its authority and relevance.

On completion of this course you should be able to:

  1. Critically reflect upon different data sources, types, formats and structures.
  2. Apply data integration techniques to import and combine different sources of data.
  3. Apply different data manipulation techniques to recode, filter, select, split, aggregate, and reshape the data into a format suitable for statistical analysis.
  4. Justify data by detecting and handling missing values, outliers, inconsistencies and errors.
  5. Demonstrate practical experience by having been exposed to real data problems.
  6. Effectively use leading open source software for reproducible, automated data preprocessing.

Overview of Learning Activities

Course learning activities take place both online and face-to-face. Online course notes and materials replace traditional lectures and labs. Face-to-face class time is mainly used for hands-on demonstrations of concepts and software use and working in groups on module exercises and problems. You will develop your data preprocessing skills through the completion of regular module exercises and assignments that consolidate learning and prepare for the final exam.


Total study hours

You will undertake 3 hours per week of face-to-face learning in class. In addition to the weekly classes, you are expected to spend approximately another six hours per week on activities related to this course. These activities include reading and practicing online course material, completing module exercises and assignments, and preparing for assessments.

Overview of Learning Resources

There are no prescribed texts for this course. All course content, notes, learning materials and data sets will be available through the course website and Canvas LMS. A list of recommended textbooks for this course will also be provided.


You are highly recommended to bring a portable computing device to class, preferably a laptop, with Wi-Fi access to the RMIT University network. You will also require open source software used in the course to be installed on your personal computing device.

Overview of Assessment

This course has no hurdle requirements.


Assessment Task 1:  Module Exercises

Mini exercises aligned to each course module

Weighting 10%

This assessment task supports CLOs 1, 2, 3, 4, and 5.

Assessment Task 2:  Assignments     Assignments staggered throughout the semester.   Weighting 40%   This assessment task supports CLOs 1, 2, 3, 4, 5 and 6.


Assessment Task 3: Final Examination

A two-hour final examination during the exam period

Weighting 50%

This assessment task supports CLOs 1, 2, 3, 4, and 5.