Galvanize Data Science Immersive Program
Data Science Immersive Program (January 2017)
The following is the standard of Galvanize Inc’s 3 month full-time immersive program. It resembles the body of knowledge obtained, though more time and work was needed for proficiency in application.
What is a Standard?
Standards are the core-competencies of data scientists - the knowledge, skills, and habits every Galvanize graduate should possess. These were carefully crafted in a joint effort by your lead instructors, and represent those knowledge, skills, and habits we believe students need to get your foot in the door and be successful in industry.
Standards by Topic
- Explain the difference between mutable and immutable types and their relationship to dictionaries.
- Compare the strengths and weaknesses of lists vs. dictionaries.
- Choose the appropriate collection (dict, Counter, defaultdict) to simplify a problem.
- Compare the strengths and weaknesses of lists vs. generators.
- Write pythonic code.
- Version Control / Git
- Explain the basic function and purpose of version control.
Use a basic Git workflow to track project changes over time, share code, and write useful commit messages.
- Given the code for a python class, instantiate a python object and call the methods and list the attributes.
- Write the python code for a simple class.
- Match key “magic” methods to their functionality.
- Design a program or algorithm in object oriented fashion.
Compare and contrast functional and object oriented programming.
- Connect to a SQL database via command line (i.e. Postgres).
- Connect to a database from within a python program.
- State function of basic SQL commands.
- Write simple queries on a single table including SELECT, FROM, WHERE, CASE clauses and aggregates.
- Write complex queries including JOINS and subqueries.
- Explain how indexing works in Postgres.
- Create and dump tables.
- Format a query to follow a standard style.
Move data from SQL database to text file.
- Explain/use the relationship between DataFrame and Series
- Know how to set, reset indexes
- Use iloc, loc, ix, and iat appropriately
- Use index alignment and know when it applies
- Use Split-Apply-Combine Methods
- Be able to read and write data to pandas
- Recognize problems that can probably be solved with Pandas (as opposed to writing vanilla Python functions).
Use basic DateTimeIndex functionality
- Describe the architecture of a matplotlib figure
- Plot in and outside of notebooks with matplotlib and seaborn
- Combine multiple datasets/categories in same plot
- Use subplots effectively
- Plot with Pandas
- Use and explain scatter_matrix output
- Use and explain a correlation heatmap
- Visualize pairwise relationships with seaborn
- Compare within-class distributions
Use matplotlib techniques with seaborn
- Explain the difference between exploratory and explanatory visualizations.
- Explain what a visualization is
- Don’t lie with data
- Visualize multidimensional relationships with data using position, size, color, alpha, facets.
Create an explanatory visualization that makes a relationship in data explicit.
- Perform basic file operations from the command line, while consulting man/help/Google if necessary.
- Get help using man (ex man grep)
- Perform “survival” edits using vi, emacs, nano, or pico
- Configure environment & aliases in .bashrc/.bash_profile/.profile
- Install data science stack
- Manage a process with job control
- Examine system performance and kill processes
- Work on a remote machine with ssh/scp
- State what an RE (regular expression) is and write a simple one
State the features and use cases of grep/sed/awk/cut/paste to process/clean a text file
- Define what a random variable is.
- Explain difference between permutations and combinations.
- Recite and perform major probability laws from memory: * Bayes Rule * LOTP * Chain Rule
- Recite and perform major random variable formulas from memory: * E(X) * Var(X) * Cov(X,Y)
- Describe what a joint distribution is and be able to perform a simple calculation using joint distribution.
- Define each major probability distributions and give 1 clear example of each
- Explain independence of 2 r.v.’s and implications with respect to probability formulas, covariance formulas, etc.
- Compute expectation of aX+bY and explain that it is a linear operator, where X and Y are random variables
- Compute variance of aX + bY
- Discuss why correlation is not causation
Describe correlation and its perils, with reference to Anscombe’s quartet
- Compute MLE estimate for simple example (such as coin-flipping)
- Pseudocode Bootstrapping for a given sample of size N.
- Construct confidence interval for case where parametric construction does not work
- Discuss examples of times when you need bootstrapping.
- Define the Central Limit Theorem
- Compute standard error
Compare and contrast the use cases of parametric and nonparametric estimation
- Hypothesis Testing
- Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for the difference of means or proportions.
- Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for Chi-square test of independence
- Describe a situation in which a one-tailed test would be appropriate (vs. a two-tailed test).
- Given a particular situation, correctly choose among the following options: * z-test * t-test * 2 sample t-test (one-sided and two-sided) * 2 sample z-test (one-sided and two-sided)
- Define p-value, Type I error, Type II error, significance level and discuss their significance in an example problem.
- Account for the multiple comparisons problem via Bonferroni correction.
- Compute the difference of two independent random normal variables.
Discuss when to use an A/B test to evaluate the efficacy of a treatment
- Define Power and relate it to the Type II error.
- Compute power given a dataset and a problem.
- Explain how the following factors contribute to power: * sample size * effect size (difference between sample statistics and statistic formulated under the null) * significance level
- Identify what can be done to increase power.
- Estimate sample size required of a test (power analysis) for one sample mean or proportion case
- Solve by hand for the posterior distribution for a uniform prior based on coin flips.
- Solve Discrete Bayes problem with some data
- What is the difference between Bayesian and Frequentist inference, with respect to fixed parameters and prior beliefs?
- Define power - Be able to draw the picture with two normal curves with different means and highlight the section that represents Power.
Explain trade off between significance and power
- Multi Armed Bandit
- Explain the difference between a frequentist A/B test and a Bayesian A/B test.
- Define and explain prior, likelihood, and posterior.
- Explain what a conjugate prior is and how it applies to A/B testing.
- Analyze an A/B test with the Bayesian approach.
- Explain how multi-armed bandit addresses the tradeoff between exploitation and exploration, and the relationship to regret.
Write pseudocode for the Multi-Armed Bandit algorithm.
- Linear Algebra in Python
- Perform basic Linear Algebra operations by hand: Multiply matrices, subtract matrices, Transpose matrices, verify inverses.
Perform linear algebra operations (multiply matrices, transpose matrices, and invert matrices) in numpy.
- Exploratory Data Analysis (EDA)
- Define EDA in your own words.
- Identify the key questions of EDA.
Perform EDA on a dataset.
- Linear Regression
- State and troubleshoot the assumptions of linear regression model. Describe, interpret, and visualize the model form of linear regression: Y = B0+B1X1+B2X2+….
- Relate Beta vector solution of Ordinary Least Squares to the cost function (residual sum of squares)
- Perform ordinary least squares (OLS) with statsmodels and interpret the output: Beta coefficients, p-values, R^2, adjusted-R^2, AIC, BIC
- Explain how to incorporate interactions and categorical variables into linear regression
Explain how one can detect outliers
- Cross Validation & Regularized Linear Regression
- Perform (one-fold) cross-validation on dataset (train test splitting)
- Algorithmically, explain k-fold cross-validation
- Give the reasoning for using k-fold cross-validation
- Given one full model and one regularized model, name 2 appropriate ways to compare the two models. Name 1 inappropriate way.
- Generally, when we increase flexibility or complexity of model, what happens to bias? variance? training error? test error?
- Compare and contrast Lasso and Ridge regression.
- What happens to Bias and Variance as we change the following factors: sample size, number of parameters, etc.
- What is the cost function for Ridge? for Lasso?
- Build test error curve for Ridge regression, while varying the alpha parameter, to determine optimal level or regularization
Build and interpret Learning curves for two learning algorithms, one that is overfit (high variance, low bias) and one that is underfit (low variance, high bias)
- Logistic Regression
- Place logistic regression in the taxonomy of ML algorithms
- Fit and interpret a logistic regression model in scikit-learn
- Interpret the coefficients of logistic regression, using odds ratio
- Explain ROC curves
Explain the key differences and similarities between logistic and linear regression.
- Gradient Descent
- Identify and justify use cases for and failure modes of gradient descent.
- Write pseudocode of the gradient descent and stochastic gradient descent algorithms.
Compare and contrast batch and stochastic gradient descent - the algorithms, costs, and benefits.
- Decision Trees
- Thoroughly explain the construction of a decision tree (classification or regression), including selecting an impurity measure (gini, entropy, variance)
- Recognize overfitting and explain pre/post pruning and why it helps.
- Pick the ‘best’ tree via cross-validation, for a given data set.
Discuss pros and cons
- k-th nearest neighbor (kNN)
- Write pseudocode for the kNN algorithm from scratch
- State differences between kNN regression and classification
Discuss Pros and Cons of kNN
- Random Forest
- Thoroughly explain the construction of a random forest (classification or regression) algorithm
- Explain the relationship and difference between random forest and bagging.
- Explain why random forests are more accurate than a single decision tree.
- Explain how to get feature importances from a random forest using an algorithm
How is OOB error calculated and what is it an estimate of?
- Boosted Trees
- Define boosting in your own words.
- Be able to interpret boosting output
- List advantages and disadvantages of boosting.
- Compare and contrast boosting with other ensemble methods
- Explain each of the tuning parameters and specifically how they affect the model
- Learn, tune, and score a model using scikit-learn’s boosting class
- Support Vector Machines (SVM)
- Compute a hyperplane as a decision boundary in SVC
- Explain what a support vector is in plain english
- Recognize that preprocessing, specifically making sure all predictors are on the same scale, is a necessary step
- Explain SVC using the hyperparameter, C
- Tune a SVM with an RBF using both hyperparameters C and gamma
- Tune a SVM with a polynomial kernel using both hyperparameters C and degree
- Describe why generally speaking, an SVM with RBF kernel is more likely to perform well on “tall” data as opposed to “wide” data.
- For SVMs with RBF, state what happens to bias and variance as we increase the hyperparameter “C”. State what happens to bias and variance as we increase the hyperparameter “gamma”.
- State how the “one-vs-one” and “one-vs-rest” approaches for multi-class problems are implemented.
Describe the kernel trick, being able to calculate as if high dimensional space.
- Profit Curves
- Describe the issues with imbalanced classes.
- Explain the profit curve method for thresholding.
- Explain sampling methods and give examples of sampling methods.
- Explain how they deal with imbalanced classes.
Explain cost sensitive learning and how it deals with imbalanced classes.
- Compare and contrast SQL and noSQL.
- Complete basic operations with mongo.
- Explain the basic concepts of HTML.
- Write python code to pull out an element from a web page.
Fetch data from an existing API
- Naive Bayes
- Derive the naive bayes algorithm and discuss its assumptions.
- Contrast generative and discriminative models.
Discuss the pros and cons of Naive Bayes.
- Identify and explain ways of featurizing text.
- List and explain distance metrics used in document classification.
Featurize a text corpus in Python using nltk and scikit-learn.
- List the characteristics of a dataset necessary to perform K-means
- Detail the k-means algorithm in steps, commenting on convergence or lack thereof.
- Use the elbow method to determine K and evaluate the choice
- Interpret Silhouette plot
- Interpret clusters by examining cluster centers, and exploring the data within each cluster (dataframe inspection, plotting, decision trees for cluster membership)
- Build and interpret a dendrogram using hierarchical clustering.
Compare and contrast k-means and hierarchical clustering.
- Churn Case Study
- List and explain the steps in CRISP-DM (Cross-Industry Standard Process for Data Mining)
- Perform EDA standards on case study including visualizations
- Discuss ramifications of deleting missing values when * MAR (missing at random) * MCAR (missing completely at random) * MNAR (missing not at random)
- Explain imputing missing using at least 2 different methods, list pros and cons of each method
- Explain when dropping rows is okay, when dropping features is okay?
- Be able to perform the feature engineering process
- Be able to identify target leak, and explain why this happens
State appropriate business goal and evaluation metric
- Dimensionality Reduction
- List reasons for reducing the dimensions.
- Describe how the principal components are constructed in PCA.
- Interpret the principal components of PCA.
- Determine how many principal components to keep.
- Describe the relationship between PCA and SVD.
- Compute and interpret PCA using sklearn.
Memorize the eigenvalue equation
- Write down and explain the NMF equation.
- Compare and contrast NMF, SVD, and PCA, and k-means
- Implement Alternating-Least-Squares algorithm for NMF
- Find and interpret latent topics in a corpus of documents with NMF
- Explain how to interpret H matrix? W matrix?
Explain regularization in the context of NMF.
- Recommender Systems
- Survey approaches to recommenders, their pros & cons, and when each is likely to be best.
- Describe the cold start problem and know how it affects different recommendation strategies
- Explain either the collaborative filtering algorithm or the matrix factorization recommender algorithm.
- Discuss recommender evaluation.
Discuss performance concerns for recommenders.
- Define a graph and discuss the implementation.
- List common applications of graph models.
- Discuss the searching algorithms and applications of them.
- Explain the various ways of measuring the importance of a node.
- Explain methods and applications of clustering on a graph.
- Use appropriate package to build graph data structure in Python and execute common algorithms (shortest path, connected components, …)
- Explain the various ways of measuring the importance of a node.
Explain methods and applications of clustering on a graph.
- Cloud Computing
- Scope & Configure a data science environment on AWS.
- Protect AWS resources against unauthorized access.
- Manage AWS resources using awscli, ssh, scp, or boto3.
Monitor and control costs incurred on AWS
- Parallel Computing
- Define and contrast processes vs. threads
- Define and contrast parallelism and concurrency.
- Recognize problems that require parallelism or concurrency
- Implement parallel and concurrent solutions
Instrument approaches to see the benefit of threading/parallelism.
- Map Reduce
- Explain Types of Problems which benefit from MapReduce
- Describe map-reduce, and how it relates to Hadoop
- Explain how to select the number of mappers and reducers
- Describe the role of keys in MapReduce
Perform MapReduce in python using MRJob.
- Time Series
- Recognize when time series analysis could be applied
- Define key times series concepts
- Determine structure of a time-series using graphical tools
- Compute a forecast using Box-Jenkins Methodology
- Evaluate models/forecasts using cross validation and statistical tests
- Engineer features to handle seasonal, calendar, and periodic components
Explain taxonomy of exponential smoothing using ETS framework
- Configure a machine to use spark effectively
- Describe differences and similarities between MapReduce and Spark
- Get data into spark for processing.
- Describe lazy evaluation in the context of Spark.
- Cache RDDs effectively to improve performance.
- Use Spark to do compute basic statistics
- Know the difference between Spark data types: RDD, DataFrame, DAG
- SQL in Spark
- Identify what distinguishes a Spark DataFrame from an RDD
- Explain how to create a Spark DataFrame
- Query a DF with SQL
- Transform a DF with dataframe methods
- Describe the challenges and requirements of saving schema’d datasets.
Use user-defined functions
- Data Products
- Explain REST architecture/API
- Write a basic Flask API
- Describe web architecture at a high level
- Know how to use developer tools to inspect an application
- Write a basic Flask web application
Be able to describe the difference between online and offline computation
- Fraud Case Study
- Build an MVP (minimum viable product) quickly
- Build a dashboard
- Build system to take in online data from a stream
Build production-quality product
- Explain the meaning of Big-Oh.
- Analyze the runtime of code.
- Solve whiteboarding interview questions.
Apply different techniques to addressing a whiteboarding interview problem
- Business Analytics
- Explain funnel metrics and applications
- Identify red flags in a set of funnel metrics
- Identify and discuss appropriate use cases for cohort analysis
- Identify and explain the limits of data analysis
- Given an open ended question, identify the business goal, metrics, and relevant data science solution.
- Identify excessive or improper use of data analysis
- Explain how data science is used in industry
- Understand range of business problems where AB testing applies