This is the first course in the Data Science Major. This course has two components. The first component will focus on building the fundamentals required for a learner to work with data and become a data scientist. Here, the learner will first be introduced with the discipline of Data Science which is the study of gaining insights into data through computation, statistics and visualization. The concepts of proper experimental design for a data science project and big data will also be covered in this component to establish a firm ground over which the rest of the Major will be built.
The second component of the course will focus in detail on the first two steps of the data science process after appropriate question(s) have been formed. These are data acquisition or finding or generating and preparing the data and exploratory data analysis or exploring the data through the use of statistics and visualizations. In the data acquisition part, first theoretical and practical aspects for obtaining data from heterogeneous data sources such as API, database, web or data repositories having different file formats will be discussed and then methods for data cleaning, organizing, merging and managing data required for effective downstream data analysis will be covered. In the exploratory data analysis part, tools and techniques for summarizing data will be covered. These techniques are typically applied before more formal data analysis commences and can help inform the development of more complex statistical analysis. Exploratory techniques are also important for eliminating or sharpening potential hypotheses that are generated to answer the question(s) in a data science project. Data visualization using graphs and some of the common multivariate statistical techniques such as clustering and dimensionality reduction that can be used for visualizing high-dimensional data will be discussed in this part.
This is the second course in the Data Science Major. In this course the learner will be introduced to statistical tools and techniques that can be used to analyze data. These tools can be used to draw effective conclusions or inferences about populations or scientific truths from data that can then help answer the question(s) in a data science project. The tools and techniques that will be covered in this course are probability, random variables and expected values, variability, distributions, limits, confidence intervals, testing, p values, power, bootstrapping and permutation tests.
As many of these fundamental techniques have already been covered in STA 101 Probability & Statistics in Software Engineering, here the focus will be to address these techniques in a data science project setting that is not only how these techniques work will be covered here but more importantly when, how and why each of these techniques should be applied in a data science project will be imparted to the learner.
This is the third course in the Data Science Major. The focus of this course will be to apply machine learning tools and techniques for predictive data analysis in order to help answer the questions(s) in a data science project. The course will assume and pile on the basic components of building and applying prediction functions including feature creation, algorithms, evaluation or model validation, training and test sets, overfitting, and error rates that are covered in SE 443 Machine Learning.
The algorithms and the machine learning methods that will be covered in this course are multivariate linear regression, logistic regression, KNN, SVM, decision trees, random forests, mean-shift clustering, density-based spatial clustering of applications with noise (DBSCAN), expectation–maximization(EM) clustering using gaussian mixture models (GMM), agglomerative hierarchical clustering and non-linear dimensionality reduction. Emphasis will be put not only on how these algorithms and methods function but also on when, how and why each of these should be applied in a data science project.
This is the fourth course in the Data Science Major. It has two components. The first component focuses on naive bayes classifiers, deep learning based classifiers and ensemble methods that can be used for predictive data analysis in order to help answer the question(s) in a data science project. The deep learning based classifiers that will be covered in the course are convolutional neural network (CNN), recurrent neural network (RNN) and long short-term memory (LSTM). Emphasis will be put not only on how these algorithms and methods function but also on when, how and why each of these should be applied in a data science project. As in the case of DS103 Machine Learning driven Data Analysis I, this course will also assume and pile on the basic components of building and applying prediction functions including feature creation, algorithms, evaluation or model validation, training and test sets, overfitting, and error rates that are covered in SE 443 Machine Learning.
The second component of this course will focus on communicating the insights gained to an audience after a successful data analysis phase. Tools and techniques for effective presentation of the outcomes of a data science project that include story-telling, interactive visualizations and presentations for the general audience and library support along with documentation for engineers building data driven software/hardware products will be covered here.
The Data Science Major Capstone Project will allow the learners to apply the knowledge acquired in the four courses of the Major to complete a data science project addressing a real world problem preferably in collaboration with an industry or government organization. This project will act as a testament to the skills and knowledge of the learners in the data science domain to potential future employers. It will act as a substitute for the course SE 422 Final Year Thesis/Project/Internship. It will be conducted over the course of two semesters - L4 T2 and L4 T3. By the end of L4 T2, a learner is expected to select a suitable real world problem with the guidance of supervisor(s), define appropriate questions which will steer the rest of the project, acquire relevant data, carry out exploratory data analysis and if needed statistical data analysis as well. By the end of L4 T3, the learner is then expected to carry out relevant machine learning driven data analysis if needed and finally prepare an effective presentation comprising story and interactive visualizations and/ library support along with documentation for communicating the outcomes of the project to the appropriate audience.
SEMESTER | COURSE CODE | COURSE NAME | Prerequisite | Theory Credit | Lab Credit | Total Credit |
---|---|---|---|---|---|---|
9th (3-3) | DS 331 | Introduction To Data Science and Data Management & Analysis (DS Major) | STA 101, SE 121 | 2 | 3 | |
DS 332 | Introduction To Data Science and Data Management & Analysis Lab (DS Major) | STA 101, SE 121 | 1 | |||
10th (4-1) | DS 411 | Statistical Data Analysis (DS Major) | DS 331, DS 332 | 2 | 3 | |
DS 412 | Statistical Data Analysis Lab (DS Major) | DS 331, DS 332 | 1 | |||
11th (4-2) | DS 421 | Machine Learning Driven Data Analysis I(DS Major) | DS 411,DS 412, SE 544 | 2 | 6 | |
DS 422 | Machine Learning Driven Data Analysis Lab I (DS Major) | DS 411,DS 412, SE 544 | 1 | |||
DS 423 | Machine Learning Driven Data Analysis II and Communicating Data Insights (DS Major) | DS 411,DS 412, SE 544 | 2 | |||
DS 424 | Machine Learning Driven Data Analysis II and Communicating Data Insights Lab (DS Major) | DS 411,DS 412, SE 544 | 1 | |||
12th(4-3) | DS 431 | Data Science Major Capstone Project (DS Major) | ALL DS Major Courses | 6 | 6 |