𧬠EPIC Dataset: Complete Preprocessing & Analysis Guide
Warning: This is a living document. Last updated: April 20, 2026
π Table of Contents
- Quick Start (5 minutes)
- The Big Picture
- Raw Data Format
- Processing Pipeline
- Output Specification
- Data Structures
- Practical Usage
- Validation & Quality Control
- Troubleshooting
- File Manifest
β‘ Quick Start
In a hurry? Here’s what you need to know in 5 minutes:
What is this dataset?
We’ve processed 260 embryos of C. elegans using EPIC (eMbryo Project Imaging Consortium) fluorescence microscopy. The output is spatio-temporal graph data perfect for training graph neural networks (GNNs).
The loop in 10 lines:
import numpy as np
# Load one embryo
npz = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz", allow_pickle=True)
# Get tensors
X = npz["X"] # (688 cells, 5 features, 210 timepoints)
alive_mask = npz["alive_mask"] # Boolean: which cells are alive at time t
edges_src, edges_dst = npz["edge_src"], npz["edge_dst"]
idx_to_cell = npz["idx_to_cell"] # Cell name lookup
# Filter to living cells at time t=100
t = 100
alive = np.where(alive_mask[:, t])[0]
X_t = X[alive, :, t] # (M active cells, 5 features)
print(f"At time {t}: {len(alive)} cells alive")
Where to go next:
- Want working examples? β See Practical Usage
- Building a GNN? β See Data Structures
- Something broke? β See Troubleshooting
π¬ The Big Picture
Why this dataset matters
C. elegans embryonic development is one of the most well-characterized biological systems:
- π― Perfect model system: Complete lineage is known (959 cells at stage end)
- π¬ Observable in real-time: Fluorescence microscopy captures cell division, migration, and differentiation
- π Rich structure: Both spatial (proximity-based) and temporal (lineage-based) relationships
- π§ GNN-friendly: Natural graph representation (cells as nodes, adjacencies as edges)
What EPIC gives us
Microscopy Videos (260 embryos)
β [Extract cell positions over time]
Raw CSV Tables (260 files, ~100k rows each)
β [Preprocess: normalize, verify, build graphs]
Tensor Archives (260 compressed NPZ files)
β [Load in PyTorch/TensorFlow]
Spatio-Temporal Graphs (Ready for GNNs)
Key facts at a glance
| Metric | Value | Notes |
|---|---|---|
| Embryos | 260 | Complete datasets, quality-controlled |
| Cells per embryo | ~688 (varies) | Variable number, from ~100 to ~900 |
| Timepoints per embryo | ~210 (varies) | Embryo development: ~13 hours 30 min per embryo |
| Features per cell | 5 | Position (x, y, z) + morphology (size, fluorescence) |
| Spatial edges | ~45k per embryo | Proximity-based (distance < 20 ΞΌm) |
| Lineage edges | ~5k per embryo | Parentβdaughter divisions |
| Total storage | ~180 MiB | All 260 embryos compressed |
| Per-embryo size | ~0.7 MiB | Highly compressed sparse graphs |
π₯ Raw Data Format: What Comes In
Source: EPIC CSV Files
Each embryo is stored as a comma-separated table in dataset/raw/*.csv.
File count: 260 unique recordings
File size: 1-5 MiB each (uncompressed)
Encoding: UTF-8
Column Breakdown
Here’s what each column means:
| Column | Type | Unit | Used? | What it tells us |
|---|---|---|---|---|
cellTime |
str | β | β | Row ID (artifact; removed) |
cell |
str | β | β | Cell name in C. elegans nomenclature (e.g., “ABal”, “Zrp1aaa”) |
time |
int | frame | β | Timepoint (1-indexed, typically 1β210) |
none |
int | β | β | Unknown field (legacy; ignored) |
global |
int | β | β | Unused metadata offset |
local |
int | β | β | Unused metadata offset |
blot |
float | AU | β | Fluorescence intensity: brightfield/dark-field marker. Primary cell identity. Range: ~100 to 10M arbitrary units |
cross |
float | β | β | Cross-correlation metric (unused) |
z |
float | ΞΌm | β | Depth coordinate (focal plane offset relative to reference). Range: 0β200 ΞΌm |
x |
float | px | β | Horizontal position (X-axis, pixels). Range: 0β512 px |
y |
float | px | β | Vertical position (Y-axis, pixels). Range: 0β512 px |
size |
float | AU | β | Cell volume / morphological size (estimated counts). Range: ~10β5000 AU |
gweight |
float | β | β | Image intensity weight (unused) |
Example Raw Record
cellTime,cell,time,none,global,local,blot,cross,z,x,y,size,gweight
ABal:25,ABal,25,22722,-2278,10,815432,-2278,19.2,166,257,74,815432
This means: Cell named ABal at timepoint 25 is located at position (166 px, 257 px, 19.2 ΞΌm) with fluorescence intensity 815432 AU and volume ~74 AU.
Key observations
- One row = one cell at one timepoint
- Sparse data: Cells only appear after birth. Cell “ABa” might start at timepoint 5, while “AB” starts at timepoint 1
- Lineage encoding: Cell names nest hierarchically. “ABal” is a daughter of “ABa” (remove last letter = parent)
- Variable embryo length: Some embryos have 180 timepoints, others 250+
- No missing values: Every row has all columns
βοΈ Processing Pipeline: The Transformation
System Overview
Raw CSV (1 embryo)
β [build_local_index]
β EpicIndex: cellβidx mapping, timeframe [t0, T]
β [populate_X & alive_mask]
β Node features X[N, 5, T] + birth tracking
β [build_spatial_edges]
β Proximity graph (undirected): cells < 20 ΞΌm apart
β [build_lineage_edges]
β Division graph (directed): parentβdaughter
β [save_to_npz]
Compressed NPZ archive (1 embryo, ~0.7 MiB)
Step 1: Index Building (build_local_index)
Purpose: Create a consistent cell name β index mapping for this embryo.
Input: Raw CSV DataFrame
Output:
cell_to_idx: dict[str, int]β Maps “ABal” β 42 (for example)t0: intβ First timepoint (usually 1)T: intβ Total number of timepoints
Algorithm:
cells = sorted(df["cell"].unique()) # Alphabetical order
cell_to_idx = {cell: idx for idx, cell in enumerate(cells)}
t0 = int(df["time"].min())
T = int(df["time"].max() - t0) + 1
Why alphabetical? Reproducibility. Same name β same index every time.
Step 2: Feature Population (populate_X & populate_alive_mask)
Purpose: Build the main feature tensor X[N, d, T] and birth-tracking mask.
Input:
cell_to_idxfrom Step 1- Raw CSV with columns: x, y, z, size, blot
Output:
X[N, 5, T]β Node features (float32)alive_mask[N, T]β Boolean: is cell alive at time t?
Algorithm (pseudocode):
X = np.zeros((N, 5, T), dtype=np.float32)
alive_mask = np.zeros((N, T), dtype=bool)
for row in df.itertuples():
c_idx = cell_to_idx[row.cell]
t_idx = row.time - t0 # Convert to 0-indexed
# Set features
X[c_idx, :, t_idx] = [row.x, row.y, row.z, row.size, row.blot]
# Mark as alive
alive_mask[c_idx, t_idx] = True
# Cells not appearing in CSV remain zeros and False
Result:
- Unborn cells:
X[:, :, t] == [0,0,0,0,0]andalive_mask[:, t] == False - Alive cells: Features populated,
alive_mask == True
Step 3: Spatial Edges (build_spatial_edges)
Purpose: Connect cells that are spatially close (proximity graph).
Input: X[N, 5, T] (node positions)
Output:
edge_src, edge_dst, edge_tβ Sparse edge lists
Algorithm:
edges = []
for t in range(T):
# Get living cell positions at time t
alive_idx = np.where(alive_mask[:, t])[0]
positions = X[alive_idx, :3, t] # (x, y, z)
# Compute pairwise distances
from scipy.spatial.distance import cdist
dist = cdist(positions, positions, metric='euclidean')
# Threshold: distance < 20 ΞΌm
src, dst = np.where((dist > 0) & (dist < 20))
for s, d in zip(src, dst):
edges.append((alive_idx[s], alive_idx[d], t))
edge_src, edge_dst, edge_t = zip(*edges)
Key facts:
- Undirected: If (A, B) is an edge, so is (B, A). Counts as 2 edges.
- Time-varying: Edges change because cells move and new cells are born
- ~45k edges per embryo: Sparse but non-trivial graph density
- Threshold 20 ΞΌm: Based on typical cell contact distance in C. elegans
Step 4: Lineage Edges (build_lineage_edges)
Purpose: Connect parent cells to daughter cells (biological divisions).
Input: cell_to_idx (cell names encode lineage)
Output:
edge_src_lineage, edge_dst_lineage, edge_t_lineage
Algorithm (simplified):
The C. elegans naming convention encodes division:
- “AB” divides β “ABa” and “ABp”
- “ABa” divides β “ABal” and “ABarp”
- etc.
def get_parent_cell(cell_name):
"""Returns parent cell name (remove last letter)."""
if len(cell_name) <= 1:
return None # Root cell (no parent)
return cell_name[:-1]
edges_lineage = []
for cell, idx in cell_to_idx.items():
parent = get_parent_cell(cell)
if parent and parent in cell_to_idx:
parent_idx = cell_to_idx[parent]
# When does daughter appear? First non-zero timestamp
t_birth = np.where(alive_mask[idx, :])[0]
if len(t_birth) > 0:
t = t_birth[0]
edges_lineage.append((parent_idx, idx, t))
Key facts:
- Directed: Always parent β daughter (causal)
- ~5k edges per embryo: Only one per cell birth (except root)
- One edge per cell: Each cell has β€1 parent (except root AB)
π€ Output Specification: What Goes Out
File Format: NPZ Archive
Each processed embryo is saved as a .npz file (NumPy compressed archive).
Location: dataset/processed/by_embryo/*.npz
Compression: ZIP with NumPy arrays (highly compressed sparse graphs)
Size: ~0.7 MiB per embryo
Format: Binary (not human-readable; must load with np.load())
What’s inside?
npz = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz", allow_pickle=True)
print(npz.files) # What's inside?
# Output:
# ['X', 'alive_mask', 'edge_src', 'edge_dst', 'edge_t',
# 'idx_to_cell', 'metadata']
Array Specifications
1. X β Node Features
Shape: (N, 5, T)
Dtype: float32
Meaning: Temporal trajectory of each cell's 5 features
X[cell_idx, feature_idx, time_idx] = value
Feature indices:
0 β x (horizontal, pixels)
1 β y (vertical, pixels)
2 β z (depth, ΞΌm)
3 β size (morphology, AU)
4 β blot (fluorescence, AU)
Unborn cells: X[cell_idx, :, t] = [0, 0, 0, 0, 0]
Example:
X[42, :, 100] = [245.3, 128.7, 15.2, 156.0, 892451.0]
# Cell at index 42, at time 100: x=245 px, y=129 px, z=15.2 ΞΌm, ...
2. alive_mask β Birth & Survival Tracking
Shape: (N, T)
Dtype: bool
Meaning: True = cell is alive at this timepoint
alive_mask[cell_idx, time_idx] = True/False
Usage: Filter to only living cells
alive_at_t = np.where(alive_mask[:, t])[0]
X_active = X[alive_at_t, :, t]
Example:
alive_at_t100 = np.where(alive_mask[:, 100])[0] # (M,) indices
print(f"{len(alive_at_t100)} cells alive at t=100")
X_active = X[alive_at_t100, :, 100] # (M, 5)
3. edge_src, edge_dst, edge_t β Spatial Graph Edges
Shape: (E,) for each
Dtype: int32
Meaning: Source node, destination node, timepoint
edge_src[i], edge_dst[i], edge_t[i] = (source_id, dest_id, time)
β Connects cell source_id to cell dest_id at timepoint time
β Undirected: implies reverse edge also exists (usually explicit)
Total edges: E β 45k per embryo (varies)
Example:
# Find all edges at time t=50
at_t50 = np.where(edge_t == 50)[0]
srcs_50 = edge_src[at_t50]
dsts_50 = edge_dst[at_t50]
print(f"{len(at_t50)} edges at time 50")
# Build adjacency matrix at t=50
from scipy.sparse import coo_matrix
adj_50 = coo_matrix((np.ones(len(at_t50)), (srcs_50, dsts_50)),
shape=(N, N))
4. idx_to_cell β Cell Name Lookup
Shape: (N,)
Dtype: object (str)
Meaning: Maps node index back to cell name
idx_to_cell[cell_idx] = "ABal"
β Cell at index cell_idx is named "ABal"
Reverse lookup: cell_to_idx = {v: k for k, v in enumerate(idx_to_cell)}
Example:
idx_to_cell = npz["idx_to_cell"]
print(idx_to_cell[42]) # "ABal"
# Reverse mapping
cell_to_idx = {cell: idx for idx, cell in enumerate(idx_to_cell)}
print(cell_to_idx["ABal"]) # 42
5. metadata β File Metadata
Shape: 1-D array (usually)
Dtype: object (dict)
Meaning: Provenance & processing info
Typical contents:
{
't0': 1,
'T': 210,
'N': 688,
'source_file': 'CD011605_5a_bright.csv',
'processing_version': '1.0',
'timestamp': '2026-04-20T12:34:56Z'
}
π Data Structures & Shapes
At a Glance
import numpy as np
# Example embryo dimensions
N = 688 # Number of cells
d = 5 # Features: x, y, z, size, blot
T = 210 # Timepoints
E = 56605 # Spatial edges
E_lineage = 687 # Lineage edges (β N-1, one per cell)
# Tensors
X # (688, 5, 210) float32 β positions & morphology
alive_mask # (688, 210) bool β birth tracking
edge_src # (56605,) int32 β spatial graph sources
edge_dst # (56605,) int32 β spatial graph destinations
edge_t # (56605,) int32 β timepoints
idx_to_cell # (688,) object β names
Typical Statistics
Mean cells alive per timepoint: ~520 (out of 688)
Mean degree (contacts/cell): ~43 (45k edges / 688 cells)
Min cells in embryo: 88 (very early timepoint)
Max cells in embryo: 688 (late development)
Memory Footprint
Loaded in memory (single embryo):
X: 688 Γ 5 Γ 210 Γ 4 bytes = 2.9 MiB
alive_mask: 688 Γ 210 Γ 1 byte = 0.1 MiB
Sparse edges: 56605 Γ 3 Γ 4 bytes = 0.7 MiB
Total per embryo: ~3.7 MiB (uncompressed)
Stored on disk: ~0.7 MiB (NPZ compressed)
Compression ratio: ~5.3:1
All 260 embryos: ~180 MiB (on disk)
π» Practical Usage
Loading Data
import numpy as np
# Load single embryo
npz = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz",
allow_pickle=True)
# Extract arrays
X = npz["X"] # (688, 5, 210)
alive_mask = npz["alive_mask"] # (688, 210)
edge_src = npz["edge_src"]
edge_dst = npz["edge_dst"]
edge_t = npz["edge_t"]
idx_to_cell = npz["idx_to_cell"]
metadata = npz["metadata"].item() # Convert numpy object to dict
print(f"Embryo: {metadata['source_file']}")
print(f" {metadata['N']} cells, {metadata['T']} timepoints")
Filtering to Living Cells
t = 100 # Query at timepoint 100
# Which cells are alive?
alive_idx = np.where(alive_mask[:, t])[0]
print(f"{len(alive_idx)} cells alive at t={t}")
# Get their features
X_alive = X[alive_idx, :, t] # (M, 5)
cell_names_alive = idx_to_cell[alive_idx]
print(f"Feature mean: {X_alive.mean(axis=0)}")
# Output: [245.2, 189.4, 12.1, 234.5, 890000.0]
Building an Adjacency Matrix
from scipy.sparse import coo_matrix
# Static adjacency at time t=50
t = 50
mask = (edge_t == t)
srcs = edge_src[mask]
dsts = edge_dst[mask]
# COO format (efficient for construction)
adj = coo_matrix((np.ones(len(srcs)), (srcs, dsts)), shape=(N, N))
# Convert to CSR (efficient for matrix ops)
adj_csr = adj.tocsr()
print(f"Adjacency at t={t}: {adj_csr.nnz} edges")
Tracing Cell Lineage
# Build reverse lookup
cell_to_idx = {cell: idx for idx, cell in enumerate(idx_to_cell)}
def trace_daughters(parent_name, depth=3):
"""Recursively find all daughters of a cell."""
idx = cell_to_idx[parent_name]
# Find daughters (cells whose names start with parent_name)
daughters = [cell for cell in idx_to_cell
if cell.startswith(parent_name) and len(cell) == len(parent_name) + 1]
print(" " * depth + f"β {parent_name} ({'born' if alive_mask[idx, 0] else 'unborn'} at t=0)")
for daughter in daughters:
trace_daughters(daughter, depth + 1)
trace_daughters("AB")
Tracking Cell Movement
# Get trajectory of cell "ABal"
cell_name = "ABal"
idx = cell_to_idx[cell_name]
# Get all living timepoints for this cell
alive_t = np.where(alive_mask[idx, :])[0]
# Extract trajectory
trajectory = X[idx, :3, alive_t].T # (T_alive, 3) β x, y, z over time
# Compute displacement
displacement = np.linalg.norm(np.diff(trajectory, axis=0), axis=1)
print(f"Cell {cell_name}: born at t={alive_t[0]}, moved {displacement.sum():.1f} ΞΌm total")
Batch Loading Multiple Embryos
import os
from pathlib import Path
embryo_dir = Path("dataset/processed/by_embryo")
npz_files = sorted(embryo_dir.glob("*.npz"))
# Load stats for all embryos
stats = []
for npz_file in npz_files[:10]: # First 10
npz = np.load(npz_file, allow_pickle=True)
m = npz["metadata"].item()
stats.append({
"embryo": m["source_file"],
"n_cells": m["N"],
"n_timepoints": m["T"],
})
import pandas as pd
df_stats = pd.DataFrame(stats)
print(df_stats.describe())
β Validation & Quality Control
What we check
- No NaN values: X, alive_mask, and edges are all valid
- Edge consistency: Nodes in edges exist (< N)
- Time bounds: edge_t in [0, T-1]
- Feature ranges: Positions within microscope FOV, sizes positive
- Sparsity: Graphs are sparse (not dense)
- Lineage closure: Parents exist for all daughters
Running QC
def validate_epic_npz(npz_path):
"""Basic validation of EPIC NPZ file."""
npz = np.load(npz_path, allow_pickle=True)
X = npz["X"]
alive_mask = npz["alive_mask"]
edge_src = npz["edge_src"]
edge_dst = npz["edge_dst"]
edge_t = npz["edge_t"]
N, d, T = X.shape
# Check 1: No NaN
assert not np.any(np.isnan(X)), "NaN in X"
# Check 2: Edge bounds
assert np.all(edge_src < N) and np.all(edge_src >= 0), "edge_src out of bounds"
assert np.all(edge_dst < N) and np.all(edge_dst >= 0), "edge_dst out of bounds"
assert np.all(edge_t < T) and np.all(edge_t >= 0), "edge_t out of bounds"
# Check 3: Sparsity
E = len(edge_src)
density = E / (N * N)
assert density < 0.1, f"Graph too dense: {density:.1%}"
# Check 4: Feature ranges
assert np.all(X[:, 0, :] >= 0) and np.all(X[:, 0, :] <= 512), "x out of range"
assert np.all(X[:, 1, :] >= 0) and np.all(X[:, 1, :] <= 512), "y out of range"
assert np.all(X[:, 3, :] >= 0), "size negative"
print(f"β {npz_path.name}: {N} cells, {T} timepoints, {E} edges β VALID")
# Test all
for npz_file in sorted(Path("dataset/processed/by_embryo").glob("*.npz")):
validate_epic_npz(npz_file)
π Troubleshooting
Common Issues
Issue: KeyError: 'X' when loading NPZ
Cause: Corrupted or wrong NPZ file
Fix: Re-run preprocessing for that embryo
python scripts/preprocess_dataset.py \
--raw_dir dataset/raw \
--out dataset/processed/by_embryo \
--distance_threshold 20
Issue: MemoryError loading all 260 embryos
Cause: Trying to load everything into RAM
Fix: Load embryos one at a time or use generators
# Don't do this:
all_embryos = [np.load(f) for f in npz_files] # β OOM
# Do this instead:
for npz_file in npz_files:
npz = np.load(npz_file, allow_pickle=True)
# ... process one embryo
del npz # Free memory
Issue: Sparse edges have isolated nodes
Cause: Cells with no spatial contacts (rare edge case)
Fix: Handle with alive_mask filtering
# Get nodes with edges at time t
nodes_with_edges = np.unique([edge_src[edge_t == t],
edge_dst[edge_t == t]])
# Nodes without edges are isolated
isolated = np.setdiff1d(np.arange(N), nodes_with_edges)
Issue: Lineage ages don’t add up
Cause: Cell naming inconsistencies in source data
Fix: Check source CSV for naming errors
# Validate cell names
for cell in idx_to_cell:
parent = cell[:-1]
assert len(parent) > 0, f"Invalid cell: {cell}"
π Complete File Manifest
Directory Structure
βββ dataset/
β βββ raw/ (Raw EPIC microscopy CSV files)
β β βββ CD011505_end1red_bright.csv
β β βββ CD011605_5a_bright.csv
β β βββ ... (258 more CSV files)
β β βββ [260 files total, ~260 MiB]
β β
β βββ processed/by_embryo/ (Preprocessed NPZ archives)
β βββ CD011505_end1red_bright.npz
β βββ CD011605_5a_bright.npz
β βββ ... (258 more NPZ files)
β βββ manifest.txt
β βββ [260 files total, ~180 MiB]
β
βββ scripts/
β βββ preprocess_dataset.py (Main pipeline runner)
β βββ usage_examples.py (7 working examples)
β βββ make_figures.py (Visualization utilities)
β
βββ src/
β βββ epic_preprocess.py (Core preprocessing functions)
β βββ build_local_index()
β βββ populate_X()
β βββ populate_alive_mask()
β βββ build_spatial_edges()
β βββ build_lineage_edges()
β βββ save_to_npz()
β
βββ docs/
βββ README.md (Navigation & overview)
βββ QUICK_REFERENCE.md (5-minute lookup)
βββ DATABASE_DOCUMENTATION.md (Comprehensive reference)
βββ ARCHITECTURE.md (System design & diagrams)
βββ EPIC_COMPLETE_GUIDE.md (This file β full narrative)
Key Statistics
Total embryos processed: 260
Total raw data: ~260 MiB
Total processed data: ~180 MiB
Compression ratio: 5.3:1
Average cells per embryo: 688 (range: ~100β900)
Average timepoints per embryo: 210 (range: 180β250)
Average spatial edges per embryo: 45,000
Average lineage edges per embryo: 687
Processing time (all): ~2β4 hours (varies by hardware)
π Cross-References
Need more details?
- QUICK_REFERENCE.md β 5-minute lookup table
- DATABASE_DOCUMENTATION.md β Exhaustive reference (11 sections)
- ARCHITECTURE.md β System design & pipelines
- Usage Examples β 7 runnable code samples
π Metadata
- Author: EPIC Preprocessing Pipeline v1.0
- Last Updated: April 20, 2026
- Version: 1.0
- Source Dataset: EPIC (eMbryo Project Imaging Consortium) C. elegans embryo microscopy
- Organism: Caenorhabditis elegans L4 stage embryos
- Microscopy: Fluorescence imaging (brightfield + confocal channels)
π€ Contributing & Feedback
Found an issue? Have suggestions?
- Check Troubleshooting first
- Review QUICK_REFERENCE.md for quick answers
- File an issue with validation output from QC section
Happy analyzing! π§¬