Why Git Alone Cannot Version Data
The limits of Git for gigabyte-sized datasets.
Code Is Tiny, Data Is Huge
Git was built for source code: small text files that change line by line. Your datasets are often gigabytes of binary data, a totally different beast.
Git Stores Every Version
Git keeps a full snapshot of each file version forever. Commit a 2 GB file ten times and your repo can balloon toward 20 GB of history. 😬
All lessons in this course
- Why Git Alone Cannot Version Data
- Initialize DVC and Track a Dataset
- Push Data to Remote Storage
- Roll Back to an Earlier Dataset