Preface
Constructing a data pipeline is fundamental for deep learning projects. Introductory books often read the entire dataset into memory at once. That works for small datasets like MNIST, but for millions of images it is impossible. In this article we demonstrate how to load data only when needed.
Prerequisites
Before building the pipeline we need some advanced Python knowledge. Lists are common objects in Python, but why can we access items with []
? The secret is the __getitem__()
method of a class. By implementing __getitem__
, a class can return values for a given index.
Suppose we have a class Demo
with three attributes a
, b
and c
mapped to strings. We want to access them using indices 0-2
. We can implement __getitem__()
as follows:
|
|
We can instantiate Demo
and fetch values with []
:
|
|
This allows us to record image paths and read files from disk only when a specific index is requested.
PyTorch data pipeline basics
As mentioned, building an object with __getitem__()
is called a Map Style Dataset
1. For PyTorch such a dataset must:
- Inherit from
torch.utils.data.Dataset
- Implement
__getitem__()
The first requirement inherits useful properties and methods. The second enables index-based access. If custom sampling is required, implement __len__()
so DataLoader
and Sampler
can know the dataset length. Implementing __getitems__()
can further speed up batch reading.
Simple example – build an image loading pipeline
Assume images are stored in an images
folder with jpg
files and labels in a label
folder with text files of the same name. We want to load and preprocess data only when needed. A Map Style Dataset
can achieve this.
|
|
Instantiate ImageDataset
and wrap it with torch.utils.data.DataLoader
while specifying batch_size
, num_workers
, etc. to complete the pipeline:
|
|
Summary
We introduced how to build a Map Style Dataset
that loads data on demand. PyTorch also provides IterableDataset
2 for other scenarios, which we may cover in the future.