XBUS-511 - Diagnostics for More Informed Machine Learning

Course Description

Even with a modest dataset, the hunt for the most effective machine learning model is hard. Finding the optimal combination of features, algorithm, and hyperparameters that produce the best model frequently requires significant experimentation and iteration. This leads many machine learning practitioners to either stay inside their algorithmic comfort zones, to trail off on random walks, or to resort to automated processes like gridsearch. But whatever the path we take, many of us are left in doubt about whether our final solution really is the optimal one. And as our datasets grow in size and dimension, so too does this ambiguity.

Open source Python libraries such as Seaborn, Pandas and Yellowbrick can help make machine learning more informed with visual diagnostic tools like histograms, correlation matrices, parallel coordinates, manifold embeddings, validation and learning curves, residuals plots, and classification heatmaps. These tools enable us to tune our models with visceral cues that allow us to be more strategic in our choices.

In this course we will explore principled strategies for steering model search (e.g. visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance) to help identify better models, faster, and at lower cost to our organizations.

Enrollment in this course is open to all students and applies credit toward the

Advanced Data Science Certificate Visual Analytics track or
Data Science track.

Course Objectives

Upon successful completion of the course, students will:

Question, investigate, diagnose, and mitigate the influence of bias in the modeling process.
Challenge and extend traditional ways of thinking about data visualization in the context of the machine learning workflow.
Evaluate trade-offs in models (e.g. precision vs. recall, overfit vs. underfit, accuracy vs. training time).
Compare and contrast hypothesis driven workflows and experimental results using visual and statistical techniques.
Aggregate large amounts of complex data using visual and statistical methods.
Explore visual techniques for feature exploration, selection, projection, and dimensionality reduction.
Visually select the best model composed of feature, algorithm, and hyperparameters

Notes

Enrollment in this course is open to all students and applies CEUs toward the Data Science or Visual Analytics track.

Course Prerequisites

This program is for data science practitioners and leaders who meet the following criteria:

Have completed data science and machine learning coursework such as Georgetown’s Certificate in Data Science or college or graduate level coursework.
Are familiar with software programming in either Python or R.
Can bring a laptop with administrative privileges for courses and workshops.