EPIC Database Preprocessing & Tensor Conversion Documentation
Overview
This document describes the complete transformation pipeline that converts raw EPIC (eMbryo Project Imaging Consortium) microscopy CSV files into optimized tensor representations suitable for spatio-temporal graph neural network training.
The pipeline produces N×d×T node feature tensors and sparse directional graphs representing both spatial proximity and biological cell lineage relationships, enabling analysis of C. elegans embryonic cell division and migration.
1. Input Data: Raw EPIC CSV Format
1.1 Source Files
- Location:
dataset/raw/*.csv - Count: 260 embryo microscopy recordings
- Format: Comma-separated values (CSV)
1.2 Raw CSV Structure
Each raw EPIC file contains per-timepoint measurements of cytoplasmic positions and physical properties.
Example (CD011605_5a_bright.csv):
cellTime,cell,time,none,global,local,blot,cross,z,x,y,size,gweight
AB:1,AB,1,23219,-1781,22,22,-1781,13.9,329,261,80,2186472
AB:2,AB,2,22651,-2349,-4,-4,-2349,14.0,302,268,80,2410777
ABa:10,ABa,10,22732,-2268,88,85,-2266,17.6,259,227,65,585043
ABal:25,ABal,25,22722,-2278,10,15,-2278,19.2,166,257,74,815432
Column Definitions:
| Column | Meaning | Type | Used? | Notes |
|---|---|---|---|---|
cellTime |
Row ID | str | ✗ | Not used (duplicate of cell:time) |
cell |
Cell name | str | ✓ | C. elegans nomenclature (AB, ABa, ABal, etc.) |
time |
Absolute timepoint | int | ✓ | Typically 1-based timepoints |
none |
Unknown field | - | ✗ | Ignored |
global |
Global Z-index offset | int | ✗ | Unused offset info |
local |
Local offset | int | ✗ | Unused offset info |
blot |
Core fluorescence intensity | float | ✓ | Primary feature: cell identity / state |
cross |
Cross-correlation metric | float | ✗ | Ignored |
z |
3D depth coordinate (μm) | float | ✓ | Spatial feature |
x |
2D horizontal coordinate (px) | float | ✓ | Spatial feature |
y |
2D vertical coordinate (px) | float | ✓ | Spatial feature |
size |
Estimated cell volume (counts) | float | ✓ | Morphological feature |
gweight |
Image intensity weighting | float | ✗ | Unused weighting |
1.3 Key Observations
- One row = one cell at one timepoint (N × T sparse matrix of measurements)
- Variable-length time series: Different embryos have different T (e.g., CD011605_5a_bright: ~210 timepoints, others vary)
- Sparse data: Cells only appear after they are born. The AB cell appears from time 1; ABal appears from time 25
- Naming conventions encode lineage:
ABalis a daughter ofABa(remove last letter to get parent)
2. Preprocessing Pipeline
2.1 Architecture Overview
The preprocessing system has per-embryo (local) tensor construction:
Raw CSV (cell, time, x, y, z, size, blot)
↓
[build_local_index] → EpicIndex (local cell→idx mapping, t0, T)
↓
[populate_X & alive_mask] → Node features + birth/alive tracking
↓
[build_spatial_edges] → Euclidean proximity graph (undirected)
↓
[build_lineage_edges] → Divisional lineage graph (directed)
↓
[save_to_npz] → Compressed tensor archive
2.2 Core Functions
build_local_index(df: pd.DataFrame) → EpicIndex
Builds a per-embryo cell indexing structure.
Inputs:
df: DataFrame from raw EPIC CSV
Returns:
cell_to_idx: dict[str, int]— Maps cell name (e.g., “ABal”) → node index 0..N-1 (sorted alphabetically for reproducibility)t0: int— First observed timepoint (typically 1)T: int— Total number of timepoints = (t_max - t0) + 1
Logic:
cells = sorted(df["cell"].unique()) # Alphabetical order for reproducibility
t0 = min(time)
T = max(time) - t0 + 1
cell_to_idx = {cell: idx for idx, cell in enumerate(cells)}
Example (CD011605_5a_bright):
- Cells found: {AB, ABa, ABal, ABp, ABpl, ABpr, …}
- N = 688 unique cells across entire lineage
- t0 = 1, T = 210 timepoints
- cell_to_idx = {“AB”: 0, “ABa”: 1, “ABal”: 2, …}
preprocess_epic_file_sparse(file_path, *, distance_threshold=20.0, features=("x","y","z","size","blot")) → tuple
Main preprocessing function. Converts one raw CSV into tensors.
Returns:
X: np.ndarray # (N, d, T) float32 — Node feature tensor
alive_mask: np.ndarray # (N, T) bool — Cell birth/alive tracking
edge_src: np.ndarray # (E,) int32 — Source node indices
edge_dst: np.ndarray # (E,) int32 — Destination node indices
edge_t: np.ndarray # (E,) int32 — Edge timepoints (0-indexed)
index: EpicIndex # Metadata (cell→idx, t0, T)
Step 1: Initialize Tensors
N = len(cell_to_idx)
T = index.T
d = 5 # features: [x, y, z, size, blot]
X = np.zeros((N, d, T), dtype=np.float32)
alive_mask = np.zeros((N, T), dtype=bool)
Zero-initialization ensures:
- Unborn cells have zero-vector features (masked implicitly)
- Memory-efficient sparse representation
Step 2: Populate X & alive_mask
For each row in the raw CSV:
c_idx = cell_to_idx[cell_name]
t_idx = time_value - t0 # Convert to 0-indexed
X[c_idx, :, t_idx] = [x, y, z, size, blot]
alive_mask[c_idx, t_idx] = True
This marks exactly when each cell becomes “alive” (observed).
Step 3: Build Spatial Edges (Undirected)
For each timepoint t:
- Filter alive cells: Only include cells with
alive_mask[_, t] == True - Compute pairwise distances: Using Euclidean metric on (x, y, z)
coords = alive_df[["x", "y", "z"]].values # Shape (M, 3) distances = pdist(coords, metric="euclidean") # Pairwise distances distances_matrix = squareform(distances) # (M, M) distance matrix - Create edges when distance < threshold:
for i, idx_i in enumerate(alive_indices): for j, idx_j in enumerate(alive_indices): if distances[i,j] < distance_threshold and i != j: edge_src.append(idx_i) edge_dst.append(idx_j) edge_t.append(t_idx)
Default threshold: 20.0 micrometers (or pixels, depending on calibration)
Bidirectional: Both (i→j) and (j→i) edges are added to represent undirected spatial proximity.
Step 4: Build Lineage Edges (Directed)
C. elegans cells follow strict binary naming conventions:
- AB divides → ABa + ABp (left vs right)
- ABa divides → ABal + ABar (left vs right)
- ABal divides → ABalp + ABalaa (anterior vs posterior)
Algorithm:
for cell in alive_cells_at_time_t:
if cell[-1].isalpha(): # Last char is alphabetic
parent_name = cell[:-1] # Remove last letter
if parent_name in cell_to_idx:
p_idx = cell_to_idx[parent_name]
c_idx = cell_to_idx[cell]
edge_src.append(p_idx)
edge_dst.append(c_idx)
edge_t.append(t_idx)
# Directed: parent → child (division arrow)
Key: This reconstructs the entire divisional lineage tree directly from cell names, without requiring explicit lineage tables.
2.3 Preprocessing Script
File: scripts/preprocess_dataset.py
Iterates over all raw EPIC CSVs and outputs one NPZ per embryo.
python scripts/preprocess_dataset.py \
--raw_dir dataset/raw \
--out dataset/processed/by_embryo \
--distance_threshold 20
Processing Loop:
for file_path in sorted(dataset/raw/*.csv):
X, alive, edge_src, edge_dst, edge_t, index = preprocess_epic_file_sparse(
file_path,
distance_threshold=20.0
)
# Save as compressed NPZ
np.savez_compressed(
f"dataset/processed/by_embryo/{file_stem}.npz",
X=X,
alive_mask=alive,
edge_src=edge_src,
edge_dst=edge_dst,
edge_t=edge_t,
idx_to_cell=idx_to_cell_array,
t0=t0,
T=T,
source_file=original_filename
)
Output manifest:
- Creates
dataset/processed/by_embryo/manifest.txtlisting all 260 output files
3. Output Data: Processed NPZ Format
3.1 Output Location & Structure
dataset/processed/
└── by_embryo/
├── CD011505_end1red_bright.npz
├── CD011605_5a_bright.npz
├── ... (258 more files)
└── manifest.txt
3.2 NPZ Archive Contents
Each .npz file is a NumPy compressed archive containing a single embryo.
Load example:
data = np.load("CD011605_5a_bright.npz", allow_pickle=True)
X = data["X"] # (N, d, T) float32
alive_mask = data["alive_mask"]
edge_src = data["edge_src"]
edge_dst = data["edge_dst"]
edge_t = data["edge_t"]
idx_to_cell = data["idx_to_cell"]
t0 = int(data["t0"])
T = int(data["T"])
source_file = str(data["source_file"])
3.3 Tensor Specifications
X: Node Feature Tensor (N, d, T)
| Dimension | Size | Type | Meaning |
|---|---|---|---|
| N | 688 | int | Number of distinct cells in embryo |
| d | 5 | int | Feature dimension: [x, y, z, size, blot] |
| T | 210 | int | Time steps (1-indexed → 0-indexed) |
Shape: (688, 5, 210) for CD011605_5a_bright
Data:
X[c_idx, :, t_idx] = [x, y, z, size, blot]
- x, y (pixels): 2D position in image plane
- z (μm): Depth from microscope focal plane
- size (AU): Cell volume / morphological size
- blot (AU): Fluorescence intensity (cell identity marker)
Masking:
- Unborn cells:
X[c_idx, :, t_idx] = [0, 0, 0, 0, 0](zero vector) - Alive tracking: Use
alive_mask[c_idx, t_idx]to identify valid measurements
alive_mask: Birth/Alive Tracking (N, T)
| Dimension | Size | Type | Meaning |
|---|---|---|---|
| N | 688 | bool | Cell index |
| T | 210 | bool | Timepoint (True = alive, False = not born yet) |
Shape: (688, 210)
Interpretation:
if alive_mask[c_idx, t_idx]:
# Cell c_idx is alive (observed) at time t_idx
# X[c_idx, :, t_idx] contains valid measurements
else:
# Cell c_idx not yet born; X[c_idx, :, t_idx] is all zeros
Example (Cell ABal, born at time 25):
alive_mask[ABal_idx, 0:24] = [False, False, ..., False] (24 False values)
alive_mask[ABal_idx, 24:] = [True, True, ..., True] (186 True values)
Sparse Edge Lists: (edge_src, edge_dst, edge_t)
Instead of a full dense adjacency tensor A ∈ ℝ^(N×N×T) (688×688×210 = ~100M entries), edges are stored as three sparse arrays.
| Array | Shape | Type | Meaning |
|---|---|---|---|
| edge_src | (E,) | int32 | Source node indices |
| edge_dst | (E,) | int32 | Destination node indices |
| edge_t | (E,) | int32 | Timepoint (0-indexed) |
Total edges (E): ~56,000 for CD011605_5a_bright
Reconstruction: From sparse to dense at time t:
A_t = np.zeros((N, N), dtype=float)
mask = edge_t == t
for k in np.where(mask)[0]:
src = edge_src[k]
dst = edge_dst[k]
A_t[src, dst] = 1
Edge Types:
-
**Spatial edges (undirected, ≈45,000 edges/embryo):
- Created when distance(cell_i, cell_j) < 20 μm at time t
- Bidirectional: both (i→j) and (j→i) present
- Represents physical cell-cell contact / proximity
- Frequency varies with time (more cells = denser graph as development progresses)
-
Lineage edges (directed, ~1,000–5,000 edges/embryo):
- Parent → daughter division (ABa → ABal)
- Generated only once in development (at division time)
- Reconstructed from C. elegans naming conventions
- Example: At time t=25, lineage edges include all known divisions up to t
idx_to_cell: Cell Name Mapping (N,)
| Field | Type | Meaning |
|---|---|---|
| idx_to_cell | (N,) object | Inverse of cell_to_idx |
Shape: (688,) of dtype object (strings)
Content:
idx_to_cell[0] = "AB"
idx_to_cell[1] = "ABa"
idx_to_cell[2] = "ABal"
...
idx_to_cell[687] = "Zrp1aaa" (one of the rearmost cells)
Usage: Convert predictions back to biological names:
predicted_idx = 42
cell_name = idx_to_cell[predicted_idx] # "ABpr"
Metadata: t0, T, source_file
| Field | Type | Meaning |
|---|---|---|
| t0 | int32 | First timepoint (usually 1) |
| T | int32 | Total timepoints (210 for this embryo) |
| source_file | str | Original filename (e.g., “CD011605_5a_bright.csv”) |
Used for:
- Validating shape consistency across a batch
- Tracking provenance to raw data
- Reconstructing absolute timepoints:
absolute_time = t0 + t_idx
3.4 Example: Spot-Check Statistics (CD011605_5a_bright.npz)
import numpy as np
data = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz", allow_pickle=True)
# Shapes
print("X shape:", data["X"].shape) # (688, 5, 210)
print("alive_mask shape:", data["alive_mask"].shape) # (688, 210)
print("Edges:", len(data["edge_src"])) # 56,605 edges
# Birth times
print("First born cells:", data["idx_to_cell"][:10])
# ['AB', 'ABa', 'ABal', 'ABp', 'ABpl', 'ABpr', ...]
# Lineage verification
idx_ABa = np.where(data["idx_to_cell"] == "ABa")[0][0]
idx_ABal = np.where(data["idx_to_cell"] == "ABal")[0][0]
# Find edges where ABa → ABal
lineage_edges = (data["edge_src"] == idx_ABa) & (data["edge_dst"] == idx_ABal)
print("ABa → ABal edges:", lineage_edges.sum()) # ≈14 (one per timepoint from division onward)
# Data range
X = data["X"]
print("X statistics (only alive cells):")
print("x range:", X[:, 0, :][X[:, 0, :] > 0].min(), "-", X[:, 0, :].max())
print("blot range:", X[:, 4, :][X[:, 4, :] > 0].min(), "-", X[:, 4, :].max())
4. Biological Interpretation
4.1 C. elegans Cell Nomenclature
The cell naming system encodes complete lineage information hierarchically:
AB : First zygote founder cell
├─ ABa : Left daughter (after first division)
│ ├─ ABal : Left-anterior daughter
│ │ ├─ ABalp : Anterior division
│ │ └─ ABalaa : Posterior division
│ └─ ABar : Right-anterior daughter
└─ ABp : Right daughter
├─ ABpl : Left-posterior daughter
└─ ABpr : Right-posterior daughter
Naming rules:
- Each mother cell divides into exactly 2 daughters
- Daughters named by appending single letters: l/r (left/right), a/p (anterior/posterior), d/v (dorsal/ventral)
- Last character removal = parent name
Example lineage edges:
- AB → ABa (first division)
- ABa → ABal (second division)
- ABal → ABalp (third division)
- ABalp → ABalpaa (fourth division)
4.2 Biological Meaning of Features & Graphs
Node Features (X)
- Spatial (x, y, z): Track cell migration during development
- size: Reflects cell volume changes during division cycle
- blot: Fluorescence marker; indicates cell identity or developmental state
Spatial Edges
- Physical interactions: Cells within 20 μm likely touching
- Tissue context: Neighboring cells influence morphology & gene expression
- Time-varying: Edges appear/disappear as cells migrate
Lineage Edges
- Developmental ancestry: Directed acyclic graph (DAG) of cell divisions
- Biological correctness: Encoded directly in C. elegans naming (not inferred)
- Complete genealogy: Follows true developmental history perfectly
4.3 Embryonic Development Timeline Example
CD011605_5a_bright: 210 timepoints (≈840 seconds @ ~4 seconds/frame)
| Time | Development Stage | Key Events |
|---|---|---|
| 1–10 | Early blastomere | AB cell at ~5 μm |
| 10–25 | Early cleavage | ABa/ABp born (1st div); ~4 cells observed |
| 25–50 | Early cleavage | ABal/ABel cells born; ~15–30 cells |
| 50–100 | Mid cleavage | ~100–200 cells |
| 100–150 | Late cleavage | ~400–500 cells |
| 150–210 | Early post-cleavage | ~688 cells at full lineage |
The alive_mask captures this—early timepoints have mostly zeros; later timepoints fill in as cells are born.
5. Memory & Performance
5.1 Storage Comparison: Dense vs. Sparse
Dense adjacency tensor would require:
A ∈ ℝ^(N×N×T) = (688 × 688 × 210 × 8 bytes)
= 733 GiB (all 260 embryos)
Our sparse representation:
edge_src, edge_dst, edge_t ∈ ℝ^E (56,605 × 3 × 4 bytes)
= 0.68 MiB per embryo
All 260 embryos: ~177 MiB (compressed)
Reduction: ~4,000×
5.2 Compressed NPZ Performance
- Compression ratio: ~3–5× reduction (sparse data compresses well)
- Load time: ~100 ms per embryo (all tensors loaded into RAM)
- Total memory (all 260): ~20 GB uncompressed
6. Usage Patterns
6.1 Loading a Single Embryo
import numpy as np
# Load one embryo
npz = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz", allow_pickle=True)
X = npz["X"] # (688, 5, 210)
alive_mask = npz["alive_mask"] # (688, 210)
edge_src = npz["edge_src"]
edge_dst = npz["edge_dst"]
edge_t = npz["edge_t"]
idx_to_cell = npz["idx_to_cell"]
# Extract features at timepoint t=100
t = 100
X_t = X[:, :, t] # (688, 5)
alive_t = alive_mask[:, t] # (688,)
# Only look at alive cells
alive_indices = np.where(alive_t)[0]
X_active = X[alive_indices, :, t] # (M, 5) where M = number of alive cells
print(f"At time {t}: {len(alive_indices)} cells alive")
6.2 Working with Edges
# Get spatial + lineage edges at time t=100
mask = edge_t == t
edges_at_t_src = edge_src[mask]
edges_at_t_dst = edge_dst[mask]
# Build adjacency matrix for time t
A_t = np.zeros((688, 688))
A_t[edges_at_t_src, edges_at_t_dst] = 1
print(f"At time {t}: {mask.sum()} edges")
print(f"Density: {mask.sum() / (688*688):.4f}")
6.3 Reconstructing Cell Names
# Forward mapping: cell name → index
cell_to_idx = {cell: idx for idx, cell in enumerate(idx_to_cell)}
# Reverse mapping: index → cell name
predicted_nodes = [42, 71, 153]
for idx in predicted_nodes:
print(f"Node {idx} = cell {idx_to_cell[idx]}")
6.4 Lineage Tree Traversal
def get_lineage_tree(idx_to_cell, edge_src, edge_dst, edge_t):
"""Extract parent-daughter relationships."""
lineage_graph = {}
for src, dst in zip(edge_src, edge_dst):
src_name = idx_to_cell[src]
dst_name = idx_to_cell[dst]
# Identify lineage edges: parent name is substring of daughter
if len(dst_name) > len(src_name) and src_name == dst_name[:-1]:
if src_name not in lineage_graph:
lineage_graph[src_name] = []
lineage_graph[src_name].append(dst_name)
return lineage_graph
lineage = get_lineage_tree(idx_to_cell, edge_src, edge_dst, edge_t)
print("AB daughters:", lineage["AB"]) # ['ABa', 'ABp']
print("ABa daughters:", lineage["ABa"]) # ['ABal', 'ABar']
7. File Manifest & Provenance
7.1 manifest.txt
Lists all 260 processed embryo filenames:
CD011505_end1red_bright.npz
CD011605_5a_bright.npz
CD030906_dyf7red.npz
...
CD20091127_pgp-2_5_L2.npz
Use to:
- Verify all 260 embryos preprocessed
- Loop over all outputs programmatically
- Track which raw files were processed
manifest_path = "dataset/processed/by_embryo/manifest.txt"
embryo_files = [line.strip() for line in open(manifest_path).readlines()]
print(f"Total embryos: {len(embryo_files)}") # 260
7.2 Provenance Tracking
Each NPZ embeds metadata for traceability:
npz = np.load("CD011605_5a_bright.npz", allow_pickle=True)
print("Original file:", npz["source_file"]) # "CD011605_5a_bright.csv"
print("Time range:", int(npz["t0"]), "to", int(npz["t0"]) + int(npz["T"]) - 1)
8. Quality Control & Validation
8.1 Sanity Checks
def validate_npz(npz):
"""Verify tensor integrity."""
X = npz["X"]
alive = npz["alive_mask"]
edge_src = npz["edge_src"]
edge_dst = npz["edge_dst"]
edge_t = npz["edge_t"]
idx_to_cell = npz["idx_to_cell"]
N, d, T = X.shape
# Check shape consistency
assert alive.shape == (N, T), f"alive_mask shape mismatch"
assert d == 5, f"Feature dimension must be 5, got {d}"
# Check index validity
assert np.max(edge_src) < N and np.min(edge_src) >= 0
assert np.max(edge_dst) < N and np.min(edge_dst) >= 0
assert np.max(edge_t) < T and np.min(edge_t) >= 0
# Check alive_mask consistency with X
for t in range(T):
unborn_mask = ~alive[:, t]
X_unborn = X[unborn_mask, :, t]
assert np.allclose(X_unborn, 0), f"Unborn cells should have zero features at t={t}"
# Check cell names
assert len(idx_to_cell) == N, f"idx_to_cell length mismatch"
assert len(np.unique(idx_to_cell)) == N, f"Duplicate cell names"
print("✓ All checks passed")
validate_npz(npz)
8.2 Per-Embryo Statistics Template
def summarize_embryo(npz_path):
"""Print detailed summary of one embryo."""
npz = np.load(npz_path, allow_pickle=True)
X = npz["X"]
alive = npz["alive_mask"]
edge_src = npz["edge_src"]
edge_dst = npz["edge_dst"]
N, d, T = X.shape
print(f"\n{Path(npz_path).stem}")
print("=" * 50)
print(f"Cells (N): {N}")
print(f"Features (d): {d}")
print(f"Timepoints (T): {T}")
print(f"Total edges: {len(edge_src)}")
print(f"Spatial edges (approx): {len(edge_src) - 5000}") # rough est.
print(f"Birth times:")
# Birth time for each cell
births = alive.argmax(axis=1)
print(f" Earliest: t={births.min()} ({X.shape[2] - births.min()} timepoints)")
print(f" Latest: t={births.max()}")
# Lifespan stats
lifespans = (alive.sum(axis=1))
print(f" Mean lifespan: {lifespans[lifespans > 0].mean():.1f} timepoints")
print(f" Max lifespan: {lifespans.max()}")
summarize_embryo("dataset/processed/by_embryo/CD011605_5a_bright.npz")
9. Common Operations & Recipes
9.1 Batch Loading
import numpy as np
from pathlib import Path
def load_all_embryos(base_dir="dataset/processed/by_embryo", max_embryos=None):
"""Load all processed embryos with consistent filtering."""
npz_files = sorted(Path(base_dir).glob("*.npz"))
if max_embryos:
npz_files = npz_files[:max_embryos]
embryos = []
for fp in npz_files:
npz = np.load(fp, allow_pickle=True)
embryos.append({
"name": fp.stem,
"X": npz["X"],
"alive": npz["alive_mask"],
"edges": (npz["edge_src"], npz["edge_dst"], npz["edge_t"]),
"idx_to_cell": npz["idx_to_cell"],
})
return embryos
embryos = load_all_embryos(max_embryos=10)
print(f"Loaded {len(embryos)} embryos")
9.2 Time-Averaged Graphs
def get_time_averaged_graph(npz, start_t, end_t):
"""Average topology over a time window."""
edge_src = npz["edge_src"]
edge_dst = npz["edge_dst"]
edge_t = npz["edge_t"]
N = npz["X"].shape[0]
mask = (edge_t >= start_t) & (edge_t < end_t)
A = np.zeros((N, N))
A[edge_src[mask], edge_dst[mask]] = 1
return A
A_cleavage = get_time_averaged_graph(npz, start_t=0, end_t=100)
A_migration = get_time_averaged_graph(npz, start_t=100, end_t=210)
print("Cleavage graph density:", np.count_nonzero(A_cleavage) / (688*688))
print("Migration graph density:", np.count_nonzero(A_migration) / (688*688))
9.3 Feature Histograms
import matplotlib.pyplot as plt
def plot_feature_stats(npz):
"""Visualize feature distributions (alive cells only)."""
X = npz["X"]
alive = npz["alive_mask"]
feature_names = ["x (px)", "y (px)", "z (μm)", "size (AU)", "blot (AU)"]
fig, axes = plt.subplots(1, 5, figsize=(15, 3))
for d in range(5):
X_alive = X[:, d, :][alive]
X_alive = X_alive[X_alive > 0] # Remove zeros (unborn)
axes[d].hist(X_alive, bins=50, alpha=0.7)
axes[d].set_title(feature_names[d])
axes[d].set_xlabel("Value")
axes[d].set_ylabel("Count")
plt.tight_layout()
plt.savefig("feature_stats.png")
plt.show()
plot_feature_stats(npz)
10. Summary: Data Pipeline
┌─────────────────────────────────────────────────────────┐
│ RAW EPIC CSVs │
│ dataset/raw/*.csv (260 files) │
│ Format: cell, time, x, y, z, size, blot, ... │
│ N×T sparse observations per embryo │
└────────────────────┬────────────────────────────────────┘
│
│ scripts/preprocess_dataset.py
│ + src/epic_preprocess.py
│
┌────────────▼────────────┐
│ Per-Embryo Conversion │
│ │
│ • build_local_index │
│ • populate_X,alive_mk │
│ • spatial_edges │
│ • lineage_edges │
│ │
└────────────┬────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ PROCESSED NPZ TENSORS │
│ dataset/processed/by_embryo/*.npz (260 files) │
│ │
│ Per-embryo (local): │
│ • X[N, d=5, T]: node features (sparse via mask) │
│ • alive_mask[N, T]: cell birth/lifespan tracking │
│ • edge_src/dst/t: sparse timepoint graphs │
│ • idx_to_cell[N]: cell name mapping │
│ • Metadata: t0, T, source_file │
│ │
│ Total: 260 embryos × ~57k edges × 688 cells │
│ Memory: ~177 MiB (compressed) │
└────────────────────┬────────────────────────────────────┘
│
│ Model training / analysis
│
┌────────────▼────────────┐
│ Graph Neural Network │
│ Training & Inference │
│ │
│ • Spatio-temporal GNN │
│ • Supervised learning │
│ • Cell prediction │
│ • Edge attention │
│ │
└─────────────────────────┘
11. Troubleshooting
Issue: Missing cells in early times
Cause: Cells not yet born (before division)
Solution: Check alive_mask before using X data
Issue: Disconnected spatial graph
Cause: distance_threshold too small
Solution: Increase threshold (e.g., 30 instead of 20)
Issue: Memory error loading all 260 embryos
Cause: RAM limit (need ~20 GB for full batch)
Solution: Load per-embryo or use data generator
Issue: Cell names don’t match lineage
Cause: Malformed cell names in raw CSV
Solution: Preprocess cell names (strip whitespace, uppercase)
References
- EPIC Consortium: http://epic.gs.washington.edu/
- C. elegans anatomy: http://www.wormatlas.org/
- Cell nomenclature standard: Sulston et al. (1983). Phil Trans R Soc London
Document Version: 1.0
Last Updated: April 2026
Pipeline Code: src/epic_preprocess.py