Data science project organization
What’s the best way to organize data science projects? I use the following structure.
. ├── data : processed data, versioned ├── docs : communicable notes for the project │ └── blog : linked to _posts ├── raw : files in this directory are shared as necessary │ ├── data : original data, treated as immutable │ ├── docs : "lab notes" throughout the course of the project, ordered by date │ └── resources : all resources pertaining to the project ├── results : output of analysis or models └── src : all code for the project └── scripts : scripts used in experiments, usually a staging ground for integration
- analysis as DAG
- Raw data must be immutable
- Processed data should be versioned
Written on April 4, 2018