Galvanize Data Science Immersive Program

Data Science Immersive Program (January 2017)

The following is the standard of Galvanize Inc’s 3 month full-time immersive program. It resembles the body of knowledge obtained, though more time and work was needed for proficiency in application.

What is a Standard?

Standards are the core-competencies of data scientists - the knowledge, skills, and habits every Galvanize graduate should possess. These were carefully crafted in a joint effort by your lead instructors, and represent those knowledge, skills, and habits we believe students need to get your foot in the door and be successful in industry.

Standards by Topic

  1. Python
    1. Explain the difference between mutable and immutable types and their relationship to dictionaries.
    2. Compare the strengths and weaknesses of lists vs. dictionaries.
    3. Choose the appropriate collection (dict, Counter, defaultdict) to simplify a problem.
    4. Compare the strengths and weaknesses of lists vs. generators.
    5. Write pythonic code.
  2. Version Control / Git
  3. Explain the basic function and purpose of version control.
  4. Use a basic Git workflow to track project changes over time, share code, and write useful commit messages.

  5. OOP
  6. Given the code for a python class, instantiate a python object and call the methods and list the attributes.
  7. Write the python code for a simple class.
  8. Match key “magic” methods to their functionality.
  9. Design a program or algorithm in object oriented fashion.
  10. Compare and contrast functional and object oriented programming.

  11. SQL
  12. Connect to a SQL database via command line (i.e. Postgres).
  13. Connect to a database from within a python program.
  14. State function of basic SQL commands.
  15. Write simple queries on a single table including SELECT, FROM, WHERE, CASE clauses and aggregates.
  16. Write complex queries including JOINS and subqueries.
  17. Explain how indexing works in Postgres.
  18. Create and dump tables.
  19. Format a query to follow a standard style.
  20. Move data from SQL database to text file.

  21. Pandas
  22. Explain/use the relationship between DataFrame and Series
  23. Know how to set, reset indexes
  24. Use iloc, loc, ix, and iat appropriately
  25. Use index alignment and know when it applies
  26. Use Split-Apply-Combine Methods
  27. Be able to read and write data to pandas
  28. Recognize problems that can probably be solved with Pandas (as opposed to writing vanilla Python functions).
  29. Use basic DateTimeIndex functionality

  30. Plotting
  31. Describe the architecture of a matplotlib figure
  32. Plot in and outside of notebooks with matplotlib and seaborn
  33. Combine multiple datasets/categories in same plot
  34. Use subplots effectively
  35. Plot with Pandas
  36. Use and explain scatter_matrix output
  37. Use and explain a correlation heatmap
  38. Visualize pairwise relationships with seaborn
  39. Compare within-class distributions
  40. Use matplotlib techniques with seaborn

  41. Visualization
  42. Explain the difference between exploratory and explanatory visualizations.
  43. Explain what a visualization is
  44. Don’t lie with data
  45. Visualize multidimensional relationships with data using position, size, color, alpha, facets.
  46. Create an explanatory visualization that makes a relationship in data explicit.

  47. Workflow
  48. Perform basic file operations from the command line, while consulting man/help/Google if necessary.
  49. Get help using man (ex man grep)
  50. Perform “survival” edits using vi, emacs, nano, or pico
  51. Configure environment & aliases in .bashrc/.bash_profile/.profile
  52. Install data science stack
  53. Manage a process with job control
  54. Examine system performance and kill processes
  55. Work on a remote machine with ssh/scp
  56. State what an RE (regular expression) is and write a simple one
  57. State the features and use cases of grep/sed/awk/cut/paste to process/clean a text file

  58. Probability
  59. Define what a random variable is.
  60. Explain difference between permutations and combinations.
  61. Recite and perform major probability laws from memory: * Bayes Rule * LOTP * Chain Rule
  62. Recite and perform major random variable formulas from memory: * E(X) * Var(X) * Cov(X,Y)
  63. Describe what a joint distribution is and be able to perform a simple calculation using joint distribution.
  64. Define each major probability distributions and give 1 clear example of each
  65. Explain independence of 2 r.v.’s and implications with respect to probability formulas, covariance formulas, etc.
  66. Compute expectation of aX+bY and explain that it is a linear operator, where X and Y are random variables
  67. Compute variance of aX + bY
  68. Discuss why correlation is not causation
  69. Describe correlation and its perils, with reference to Anscombe’s quartet

  70. Sampling
  71. Compute MLE estimate for simple example (such as coin-flipping)
  72. Pseudocode Bootstrapping for a given sample of size N.
  73. Construct confidence interval for case where parametric construction does not work
  74. Discuss examples of times when you need bootstrapping.
  75. Define the Central Limit Theorem
  76. Compute standard error
  77. Compare and contrast the use cases of parametric and nonparametric estimation

  78. Hypothesis Testing
  79. Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for the difference of means or proportions.
  80. Given a dataset, set up a null and alternative hypothesis, and calculate and interpret the p-value for Chi-square test of independence
  81. Describe a situation in which a one-tailed test would be appropriate (vs. a two-tailed test).
  82. Given a particular situation, correctly choose among the following options: * z-test * t-test * 2 sample t-test (one-sided and two-sided) * 2 sample z-test (one-sided and two-sided)
  83. Define p-value, Type I error, Type II error, significance level and discuss their significance in an example problem.
  84. Account for the multiple comparisons problem via Bonferroni correction.
  85. Compute the difference of two independent random normal variables.
  86. Discuss when to use an A/B test to evaluate the efficacy of a treatment

  87. Power
  88. Define Power and relate it to the Type II error.
  89. Compute power given a dataset and a problem.
  90. Explain how the following factors contribute to power: * sample size * effect size (difference between sample statistics and statistic formulated under the null) * significance level
  91. Identify what can be done to increase power.
  92. Estimate sample size required of a test (power analysis) for one sample mean or proportion case
  93. Solve by hand for the posterior distribution for a uniform prior based on coin flips.
  94. Solve Discrete Bayes problem with some data
  95. What is the difference between Bayesian and Frequentist inference, with respect to fixed parameters and prior beliefs?
  96. Define power - Be able to draw the picture with two normal curves with different means and highlight the section that represents Power.
  97. Explain trade off between significance and power

  98. Multi Armed Bandit
  99. Explain the difference between a frequentist A/B test and a Bayesian A/B test.
  100. Define and explain prior, likelihood, and posterior.
  101. Explain what a conjugate prior is and how it applies to A/B testing.
  102. Analyze an A/B test with the Bayesian approach.
  103. Explain how multi-armed bandit addresses the tradeoff between exploitation and exploration, and the relationship to regret.
  104. Write pseudocode for the Multi-Armed Bandit algorithm.

  105. Linear Algebra in Python
  106. Perform basic Linear Algebra operations by hand: Multiply matrices, subtract matrices, Transpose matrices, verify inverses.
  107. Perform linear algebra operations (multiply matrices, transpose matrices, and invert matrices) in numpy.

  108. Exploratory Data Analysis (EDA)
  109. Define EDA in your own words.
  110. Identify the key questions of EDA.
  111. Perform EDA on a dataset.

  112. Linear Regression
  113. State and troubleshoot the assumptions of linear regression model. Describe, interpret, and visualize the model form of linear regression: Y = B0+B1X1+B2X2+….
  114. Relate Beta vector solution of Ordinary Least Squares to the cost function (residual sum of squares)
  115. Perform ordinary least squares (OLS) with statsmodels and interpret the output: Beta coefficients, p-values, R^2, adjusted-R^2, AIC, BIC
  116. Explain how to incorporate interactions and categorical variables into linear regression
  117. Explain how one can detect outliers

  118. Cross Validation & Regularized Linear Regression
  119. Perform (one-fold) cross-validation on dataset (train test splitting)
  120. Algorithmically, explain k-fold cross-validation
  121. Give the reasoning for using k-fold cross-validation
  122. Given one full model and one regularized model, name 2 appropriate ways to compare the two models. Name 1 inappropriate way.
  123. Generally, when we increase flexibility or complexity of model, what happens to bias? variance? training error? test error?
  124. Compare and contrast Lasso and Ridge regression.
  125. What happens to Bias and Variance as we change the following factors: sample size, number of parameters, etc.
  126. What is the cost function for Ridge? for Lasso?
  127. Build test error curve for Ridge regression, while varying the alpha parameter, to determine optimal level or regularization
  128. Build and interpret Learning curves for two learning algorithms, one that is overfit (high variance, low bias) and one that is underfit (low variance, high bias)

  129. Logistic Regression
  130. Place logistic regression in the taxonomy of ML algorithms
  131. Fit and interpret a logistic regression model in scikit-learn
  132. Interpret the coefficients of logistic regression, using odds ratio
  133. Explain ROC curves
  134. Explain the key differences and similarities between logistic and linear regression.

  135. Gradient Descent
  136. Identify and justify use cases for and failure modes of gradient descent.
  137. Write pseudocode of the gradient descent and stochastic gradient descent algorithms.
  138. Compare and contrast batch and stochastic gradient descent - the algorithms, costs, and benefits.

  139. Decision Trees
  140. Thoroughly explain the construction of a decision tree (classification or regression), including selecting an impurity measure (gini, entropy, variance)
  141. Recognize overfitting and explain pre/post pruning and why it helps.
  142. Pick the ‘best’ tree via cross-validation, for a given data set.
  143. Discuss pros and cons

  144. k-th nearest neighbor (kNN)
  145. Write pseudocode for the kNN algorithm from scratch
  146. State differences between kNN regression and classification
  147. Discuss Pros and Cons of kNN

  148. Random Forest
  149. Thoroughly explain the construction of a random forest (classification or regression) algorithm
  150. Explain the relationship and difference between random forest and bagging.
  151. Explain why random forests are more accurate than a single decision tree.
  152. Explain how to get feature importances from a random forest using an algorithm
  153. How is OOB error calculated and what is it an estimate of?

  154. Boosted Trees
  155. Define boosting in your own words.
  156. Be able to interpret boosting output
  157. List advantages and disadvantages of boosting.
  158. Compare and contrast boosting with other ensemble methods
  159. Explain each of the tuning parameters and specifically how they affect the model
  160. Learn, tune, and score a model using scikit-learn’s boosting class
  161. Implement AdaBoost

  162. Support Vector Machines (SVM)
  163. Compute a hyperplane as a decision boundary in SVC
  164. Explain what a support vector is in plain english
  165. Recognize that preprocessing, specifically making sure all predictors are on the same scale, is a necessary step
  166. Explain SVC using the hyperparameter, C
  167. Tune a SVM with an RBF using both hyperparameters C and gamma
  168. Tune a SVM with a polynomial kernel using both hyperparameters C and degree
  169. Describe why generally speaking, an SVM with RBF kernel is more likely to perform well on “tall” data as opposed to “wide” data.
  170. For SVMs with RBF, state what happens to bias and variance as we increase the hyperparameter “C”. State what happens to bias and variance as we increase the hyperparameter “gamma”.
  171. State how the “one-vs-one” and “one-vs-rest” approaches for multi-class problems are implemented.
  172. Describe the kernel trick, being able to calculate as if high dimensional space.

  173. Profit Curves
  174. Describe the issues with imbalanced classes.
  175. Explain the profit curve method for thresholding.
  176. Explain sampling methods and give examples of sampling methods.
  177. Explain how they deal with imbalanced classes.
  178. Explain cost sensitive learning and how it deals with imbalanced classes.

  179. Webscraping
  180. Compare and contrast SQL and noSQL.
  181. Complete basic operations with mongo.
  182. Explain the basic concepts of HTML.
  183. Write python code to pull out an element from a web page.
  184. Fetch data from an existing API

  185. Naive Bayes
  186. Derive the naive bayes algorithm and discuss its assumptions.
  187. Contrast generative and discriminative models.
  188. Discuss the pros and cons of Naive Bayes.

  189. NLP
  190. Identify and explain ways of featurizing text.
  191. List and explain distance metrics used in document classification.
  192. Featurize a text corpus in Python using nltk and scikit-learn.

  193. Clustering
  194. List the characteristics of a dataset necessary to perform K-means
  195. Detail the k-means algorithm in steps, commenting on convergence or lack thereof.
  196. Use the elbow method to determine K and evaluate the choice
  197. Interpret Silhouette plot
  198. Interpret clusters by examining cluster centers, and exploring the data within each cluster (dataframe inspection, plotting, decision trees for cluster membership)
  199. Build and interpret a dendrogram using hierarchical clustering.
  200. Compare and contrast k-means and hierarchical clustering.

  201. Churn Case Study
  202. List and explain the steps in CRISP-DM (Cross-Industry Standard Process for Data Mining)
  203. Perform EDA standards on case study including visualizations
  204. Discuss ramifications of deleting missing values when * MAR (missing at random) * MCAR (missing completely at random) * MNAR (missing not at random)
  205. Explain imputing missing using at least 2 different methods, list pros and cons of each method
  206. Explain when dropping rows is okay, when dropping features is okay?
  207. Be able to perform the feature engineering process
  208. Be able to identify target leak, and explain why this happens
  209. State appropriate business goal and evaluation metric

  210. Dimensionality Reduction
  211. List reasons for reducing the dimensions.
  212. Describe how the principal components are constructed in PCA.
  213. Interpret the principal components of PCA.
  214. Determine how many principal components to keep.
  215. Describe the relationship between PCA and SVD.
  216. Compute and interpret PCA using sklearn.
  217. Memorize the eigenvalue equation

  218. NMF
  219. Write down and explain the NMF equation.
  220. Compare and contrast NMF, SVD, and PCA, and k-means
  221. Implement Alternating-Least-Squares algorithm for NMF
  222. Find and interpret latent topics in a corpus of documents with NMF
  223. Explain how to interpret H matrix? W matrix?
  224. Explain regularization in the context of NMF.

  225. Recommender Systems
  226. Survey approaches to recommenders, their pros & cons, and when each is likely to be best.
  227. Describe the cold start problem and know how it affects different recommendation strategies
  228. Explain either the collaborative filtering algorithm or the matrix factorization recommender algorithm.
  229. Discuss recommender evaluation.
  230. Discuss performance concerns for recommenders.

  231. Graphs
  232. Define a graph and discuss the implementation.
  233. List common applications of graph models.
  234. Discuss the searching algorithms and applications of them.
  235. Explain the various ways of measuring the importance of a node.
  236. Explain methods and applications of clustering on a graph.
  237. Use appropriate package to build graph data structure in Python and execute common algorithms (shortest path, connected components, …)
  238. Explain the various ways of measuring the importance of a node.
  239. Explain methods and applications of clustering on a graph.

  240. Cloud Computing
  241. Scope & Configure a data science environment on AWS.
  242. Protect AWS resources against unauthorized access.
  243. Manage AWS resources using awscli, ssh, scp, or boto3.
  244. Monitor and control costs incurred on AWS

  245. Parallel Computing
  246. Define and contrast processes vs. threads
  247. Define and contrast parallelism and concurrency.
  248. Recognize problems that require parallelism or concurrency
  249. Implement parallel and concurrent solutions
  250. Instrument approaches to see the benefit of threading/parallelism.

  251. Map Reduce
  252. Explain Types of Problems which benefit from MapReduce
  253. Describe map-reduce, and how it relates to Hadoop
  254. Explain how to select the number of mappers and reducers
  255. Describe the role of keys in MapReduce
  256. Perform MapReduce in python using MRJob.

  257. Time Series
  258. Recognize when time series analysis could be applied
  259. Define key times series concepts
  260. Determine structure of a time-series using graphical tools
  261. Compute a forecast using Box-Jenkins Methodology
  262. Evaluate models/forecasts using cross validation and statistical tests
  263. Engineer features to handle seasonal, calendar, and periodic components
  264. Explain taxonomy of exponential smoothing using ETS framework

  265. Spark
  266. Configure a machine to use spark effectively
  267. Describe differences and similarities between MapReduce and Spark
  268. Get data into spark for processing.
  269. Describe lazy evaluation in the context of Spark.
  270. Cache RDDs effectively to improve performance.
  271. Use Spark to do compute basic statistics
  272. Know the difference between Spark data types: RDD, DataFrame, DAG
  273. Use MLLib

  274. SQL in Spark
  275. Identify what distinguishes a Spark DataFrame from an RDD
  276. Explain how to create a Spark DataFrame
  277. Query a DF with SQL
  278. Transform a DF with dataframe methods
  279. Describe the challenges and requirements of saving schema’d datasets.
  280. Use user-defined functions

  281. Data Products
  282. Explain REST architecture/API
  283. Write a basic Flask API
  284. Describe web architecture at a high level
  285. Know the role of javascript in a web application
  286. Know how to use developer tools to inspect an application
  287. Write a basic Flask web application
  288. Be able to describe the difference between online and offline computation

  289. Fraud Case Study
  290. Build an MVP (minimum viable product) quickly
  291. Build a dashboard
  292. Build system to take in online data from a stream
  293. Build production-quality product

  294. Whiteboarding
  295. Explain the meaning of Big-Oh.
  296. Analyze the runtime of code.
  297. Solve whiteboarding interview questions.
  298. Apply different techniques to addressing a whiteboarding interview problem

  299. Business Analytics
  300. Explain funnel metrics and applications
  301. Identify red flags in a set of funnel metrics
  302. Identify and discuss appropriate use cases for cohort analysis
  303. Identify and explain the limits of data analysis
  304. Given an open ended question, identify the business goal, metrics, and relevant data science solution.
  305. Identify excessive or improper use of data analysis
  306. Explain how data science is used in industry
  307. Understand range of business problems where AB testing applies
Written on January 15, 2018