Galvanize Data Science Immersive Program

Data Science Immersive Program (January 2017)

The following is the standard of Galvanize Inc’s 3 month full-time immersive program. It resembles the body of knowledge obtained, though more time and work was needed for proficiency in application.

What is a Standard?

Standards are the core-competencies of data scientists - the knowledge, skills, and habits every Galvanize graduate should possess. These were carefully crafted in a joint effort by your lead instructors, and represent those knowledge, skills, and habits we believe students need to get your foot in the door and be successful in industry.

Standards by Topic

Python
1. Explain the difference between mutable and immutable types and their relationship to dictionaries.
2. Compare the strengths and weaknesses of lists vs. dictionaries.
3. Choose the appropriate collection (dict, Counter, defaultdict) to simplify a problem.
4. Compare the strengths and weaknesses of lists vs. generators.
5. Write pythonic code.
Version Control / Git
Explain the basic function and purpose of version control.
Use a basic Git workflow to track project changes over time, share code, and write useful commit messages.
OOP
Given the code for a python class, instantiate a python object and call the methods and list the attributes.
Write the python code for a simple class.
Match key “magic” methods to their functionality.
Design a program or algorithm in object oriented fashion.
Compare and contrast functional and object oriented programming.
SQL
Connect to a SQL database via command line (i.e. Postgres).
Connect to a database from within a python program.
State function of basic SQL commands.
Write simple queries on a single table including SELECT, FROM, WHERE, CASE clauses and aggregates.
Write complex queries including JOINS and subqueries.
Explain how indexing works in Postgres.
Create and dump tables.
Format a query to follow a standard style.
Move data from SQL database to text file.
Pandas
Explain/use the relationship between DataFrame and Series
Know how to set, reset indexes
Use iloc, loc, ix, and iat appropriately
Use index alignment and know when it applies
Use Split-Apply-Combine Methods
Be able to read and write data to pandas
Recognize problems that can probably be solved with Pandas (as opposed to writing vanilla Python functions).
Use basic DateTimeIndex functionality
Plotting
Describe the architecture of a matplotlib figure
Plot in and outside of notebooks with matplotlib and seaborn
Combine multiple datasets/categories in same plot
Use subplots effectively
Plot with Pandas
Use and explain scatter_matrix output
Use and explain a correlation heatmap
Visualize pairwise relationships with seaborn
Compare within-class distributions
Use matplotlib techniques with seaborn
Visualization
Explain the difference between exploratory and explanatory visualizations.
Explain what a visualization is
Don’t lie with data
Visualize multidimensional relationships with data using position, size, color, alpha, facets.
Create an explanatory visualization that makes a relationship in data explicit.
Workflow
Perform basic file operations from the command line, while consulting man/help/Google if necessary.
Get help using man (ex man grep)
Perform “survival” edits using vi, emacs, nano, or pico
Configure environment & aliases in .bashrc/.bash_profile/.profile
Install data science stack
Manage a process with job control
Examine system performance and kill processes
Work on a remote machine with ssh/scp
State what an RE (regular expression) is and write a simple one
State the features and use cases of grep/sed/awk/cut/paste to process/clean a text file
Probability
Define what a random variable is.
Explain difference between permutations and combinations.
Recite and perform major probability laws from memory: * Bayes Rule * LOTP * Chain Rule
Recite and perform major random variable formulas from memory: * E(X) * Var(X) * Cov(X,Y)
Describe what a joint distribution is and be able to perform a simple calculation using joint distribution.
Define each major probability distributions and give 1 clear example of each
Explain independence of 2 r.v.’s and implications with respect to probability formulas, covariance formulas, etc.
Compute expectation of aX+bY and explain that it is a linear operator, where X and Y are random variables
Compute variance of aX + bY
Discuss why correlation is not causation
Describe correlation and its perils, with reference to Anscombe’s quartet
Sampling
Compute MLE estimate for simple example (such as coin-flipping)
Pseudocode Bootstrapping for a given sample of size N.
Construct confidence interval for case where parametric construction does not work
Discuss examples of times when you need bootstrapping.
Define the Central Limit Theorem
Compute standard error
Compare and contrast the use cases of parametric and nonparametric estimation
Hypothesis Testing
Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for the difference of means or proportions.
Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for Chi-square test of independence
Describe a situation in which a one-tailed test would be appropriate (vs. a two-tailed test).
Given a particular situation, correctly choose among the following options: * z-test * t-test * 2 sample t-test (one-sided and two-sided) * 2 sample z-test (one-sided and two-sided)
Define p-value, Type I error, Type II error, significance level and discuss their significance in an example problem.
Account for the multiple comparisons problem via Bonferroni correction.
Compute the difference of two independent random normal variables.
Discuss when to use an A/B test to evaluate the efficacy of a treatment
Power
Define Power and relate it to the Type II error.
Compute power given a dataset and a problem.
Explain how the following factors contribute to power: * sample size * effect size (difference between sample statistics and statistic formulated under the null) * significance level
Identify what can be done to increase power.
Estimate sample size required of a test (power analysis) for one sample mean or proportion case
Solve by hand for the posterior distribution for a uniform prior based on coin flips.
Solve Discrete Bayes problem with some data
What is the difference between Bayesian and Frequentist inference, with respect to fixed parameters and prior beliefs?
Define power - Be able to draw the picture with two normal curves with different means and highlight the section that represents Power.
Explain trade off between significance and power
Multi Armed Bandit
Explain the difference between a frequentist A/B test and a Bayesian A/B test.
Define and explain prior, likelihood, and posterior.
Explain what a conjugate prior is and how it applies to A/B testing.
Analyze an A/B test with the Bayesian approach.
Explain how multi-armed bandit addresses the tradeoff between exploitation and exploration, and the relationship to regret.
Write pseudocode for the Multi-Armed Bandit algorithm.
Linear Algebra in Python
Perform basic Linear Algebra operations by hand: Multiply matrices, subtract matrices, Transpose matrices, verify inverses.
Perform linear algebra operations (multiply matrices, transpose matrices, and invert matrices) in numpy.
Exploratory Data Analysis (EDA)
Define EDA in your own words.
Identify the key questions of EDA.
Perform EDA on a dataset.
Linear Regression
State and troubleshoot the assumptions of linear regression model. Describe, interpret, and visualize the model form of linear regression: Y = B0+B1X1+B2X2+….
Relate Beta vector solution of Ordinary Least Squares to the cost function (residual sum of squares)
Perform ordinary least squares (OLS) with statsmodels and interpret the output: Beta coefficients, p-values, R^2, adjusted-R^2, AIC, BIC
Explain how to incorporate interactions and categorical variables into linear regression
Explain how one can detect outliers
Cross Validation & Regularized Linear Regression
Perform (one-fold) cross-validation on dataset (train test splitting)
Algorithmically, explain k-fold cross-validation
Give the reasoning for using k-fold cross-validation
Given one full model and one regularized model, name 2 appropriate ways to compare the two models. Name 1 inappropriate way.
Generally, when we increase flexibility or complexity of model, what happens to bias? variance? training error? test error?
Compare and contrast Lasso and Ridge regression.
What happens to Bias and Variance as we change the following factors: sample size, number of parameters, etc.
What is the cost function for Ridge? for Lasso?
Build test error curve for Ridge regression, while varying the alpha parameter, to determine optimal level or regularization
Build and interpret Learning curves for two learning algorithms, one that is overfit (high variance, low bias) and one that is underfit (low variance, high bias)
Logistic Regression
Place logistic regression in the taxonomy of ML algorithms
Fit and interpret a logistic regression model in scikit-learn
Interpret the coefficients of logistic regression, using odds ratio
Explain ROC curves
Explain the key differences and similarities between logistic and linear regression.
Gradient Descent
Identify and justify use cases for and failure modes of gradient descent.
Write pseudocode of the gradient descent and stochastic gradient descent algorithms.
Compare and contrast batch and stochastic gradient descent - the algorithms, costs, and benefits.
Decision Trees
Thoroughly explain the construction of a decision tree (classification or regression), including selecting an impurity measure (gini, entropy, variance)
Recognize overfitting and explain pre/post pruning and why it helps.
Pick the ‘best’ tree via cross-validation, for a given data set.
Discuss pros and cons
k-th nearest neighbor (kNN)
Write pseudocode for the kNN algorithm from scratch
State differences between kNN regression and classification
Discuss Pros and Cons of kNN
Random Forest
Thoroughly explain the construction of a random forest (classification or regression) algorithm
Explain the relationship and difference between random forest and bagging.
Explain why random forests are more accurate than a single decision tree.
Explain how to get feature importances from a random forest using an algorithm
How is OOB error calculated and what is it an estimate of?
Boosted Trees
Define boosting in your own words.
Be able to interpret boosting output
List advantages and disadvantages of boosting.
Compare and contrast boosting with other ensemble methods
Explain each of the tuning parameters and specifically how they affect the model
Learn, tune, and score a model using scikit-learn’s boosting class
Implement AdaBoost
Support Vector Machines (SVM)
Compute a hyperplane as a decision boundary in SVC
Explain what a support vector is in plain english
Recognize that preprocessing, specifically making sure all predictors are on the same scale, is a necessary step
Explain SVC using the hyperparameter, C
Tune a SVM with an RBF using both hyperparameters C and gamma
Tune a SVM with a polynomial kernel using both hyperparameters C and degree
Describe why generally speaking, an SVM with RBF kernel is more likely to perform well on “tall” data as opposed to “wide” data.
For SVMs with RBF, state what happens to bias and variance as we increase the hyperparameter “C”. State what happens to bias and variance as we increase the hyperparameter “gamma”.
State how the “one-vs-one” and “one-vs-rest” approaches for multi-class problems are implemented.
Describe the kernel trick, being able to calculate as if high dimensional space.
Profit Curves
Describe the issues with imbalanced classes.
Explain the profit curve method for thresholding.
Explain sampling methods and give examples of sampling methods.
Explain how they deal with imbalanced classes.
Explain cost sensitive learning and how it deals with imbalanced classes.
Webscraping
Compare and contrast SQL and noSQL.
Complete basic operations with mongo.
Explain the basic concepts of HTML.
Write python code to pull out an element from a web page.
Fetch data from an existing API
Naive Bayes
Derive the naive bayes algorithm and discuss its assumptions.
Contrast generative and discriminative models.
Discuss the pros and cons of Naive Bayes.
NLP
Identify and explain ways of featurizing text.
List and explain distance metrics used in document classification.
Featurize a text corpus in Python using nltk and scikit-learn.
Clustering
List the characteristics of a dataset necessary to perform K-means
Detail the k-means algorithm in steps, commenting on convergence or lack thereof.
Use the elbow method to determine K and evaluate the choice
Interpret Silhouette plot
Interpret clusters by examining cluster centers, and exploring the data within each cluster (dataframe inspection, plotting, decision trees for cluster membership)
Build and interpret a dendrogram using hierarchical clustering.
Compare and contrast k-means and hierarchical clustering.
Churn Case Study
List and explain the steps in CRISP-DM (Cross-Industry Standard Process for Data Mining)
Perform EDA standards on case study including visualizations
Discuss ramifications of deleting missing values when * MAR (missing at random) * MCAR (missing completely at random) * MNAR (missing not at random)
Explain imputing missing using at least 2 different methods, list pros and cons of each method
Explain when dropping rows is okay, when dropping features is okay?
Be able to perform the feature engineering process
Be able to identify target leak, and explain why this happens
State appropriate business goal and evaluation metric
Dimensionality Reduction
List reasons for reducing the dimensions.
Describe how the principal components are constructed in PCA.
Interpret the principal components of PCA.
Determine how many principal components to keep.
Describe the relationship between PCA and SVD.
Compute and interpret PCA using sklearn.
Memorize the eigenvalue equation
NMF
Write down and explain the NMF equation.
Compare and contrast NMF, SVD, and PCA, and k-means
Implement Alternating-Least-Squares algorithm for NMF
Find and interpret latent topics in a corpus of documents with NMF
Explain how to interpret H matrix? W matrix?
Explain regularization in the context of NMF.
Recommender Systems
Survey approaches to recommenders, their pros & cons, and when each is likely to be best.
Describe the cold start problem and know how it affects different recommendation strategies
Explain either the collaborative filtering algorithm or the matrix factorization recommender algorithm.
Discuss recommender evaluation.
Discuss performance concerns for recommenders.
Graphs
Define a graph and discuss the implementation.
List common applications of graph models.
Discuss the searching algorithms and applications of them.
Explain the various ways of measuring the importance of a node.
Explain methods and applications of clustering on a graph.
Use appropriate package to build graph data structure in Python and execute common algorithms (shortest path, connected components, …)
Explain the various ways of measuring the importance of a node.
Explain methods and applications of clustering on a graph.
Cloud Computing
Scope & Configure a data science environment on AWS.
Protect AWS resources against unauthorized access.
Manage AWS resources using awscli, ssh, scp, or boto3.
Monitor and control costs incurred on AWS
Parallel Computing
Define and contrast processes vs. threads
Define and contrast parallelism and concurrency.
Recognize problems that require parallelism or concurrency
Implement parallel and concurrent solutions
Instrument approaches to see the benefit of threading/parallelism.
Map Reduce
Explain Types of Problems which benefit from MapReduce
Describe map-reduce, and how it relates to Hadoop
Explain how to select the number of mappers and reducers
Describe the role of keys in MapReduce
Perform MapReduce in python using MRJob.
Time Series
Recognize when time series analysis could be applied
Define key times series concepts
Determine structure of a time-series using graphical tools
Compute a forecast using Box-Jenkins Methodology
Evaluate models/forecasts using cross validation and statistical tests
Engineer features to handle seasonal, calendar, and periodic components
Explain taxonomy of exponential smoothing using ETS framework
Spark
Configure a machine to use spark effectively
Describe differences and similarities between MapReduce and Spark
Get data into spark for processing.
Describe lazy evaluation in the context of Spark.
Cache RDDs effectively to improve performance.
Use Spark to do compute basic statistics
Know the difference between Spark data types: RDD, DataFrame, DAG
Use MLLib
SQL in Spark
Identify what distinguishes a Spark DataFrame from an RDD
Explain how to create a Spark DataFrame
Query a DF with SQL
Transform a DF with dataframe methods
Describe the challenges and requirements of saving schema’d datasets.
Use user-defined functions
Data Products
Explain REST architecture/API
Write a basic Flask API
Describe web architecture at a high level
Know the role of javascript in a web application
Know how to use developer tools to inspect an application
Write a basic Flask web application
Be able to describe the difference between online and offline computation
Fraud Case Study
Build an MVP (minimum viable product) quickly
Build a dashboard
Build system to take in online data from a stream
Build production-quality product
Whiteboarding
Explain the meaning of Big-Oh.
Analyze the runtime of code.
Solve whiteboarding interview questions.
Apply different techniques to addressing a whiteboarding interview problem
Business Analytics
Explain funnel metrics and applications
Identify red flags in a set of funnel metrics
Identify and discuss appropriate use cases for cohort analysis
Identify and explain the limits of data analysis
Given an open ended question, identify the business goal, metrics, and relevant data science solution.
Identify excessive or improper use of data analysis
Explain how data science is used in industry
Understand range of business problems where AB testing applies

Written on January 15, 2018