EPIC Database Preprocessing & Tensor Conversion Documentation

Overview

This document describes the complete transformation pipeline that converts raw EPIC (eMbryo Project Imaging Consortium) microscopy CSV files into optimized tensor representations suitable for spatio-temporal graph neural network training.

The pipeline produces N×d×T node feature tensors and sparse directional graphs representing both spatial proximity and biological cell lineage relationships, enabling analysis of C. elegans embryonic cell division and migration.

1. Input Data: Raw EPIC CSV Format

1.1 Source Files

Location: dataset/raw/*.csv
Count: 260 embryo microscopy recordings
Format: Comma-separated values (CSV)

1.2 Raw CSV Structure

Each raw EPIC file contains per-timepoint measurements of cytoplasmic positions and physical properties.

Example (CD011605_5a_bright.csv):

cellTime,cell,time,none,global,local,blot,cross,z,x,y,size,gweight
AB:1,AB,1,23219,-1781,22,22,-1781,13.9,329,261,80,2186472
AB:2,AB,2,22651,-2349,-4,-4,-2349,14.0,302,268,80,2410777
ABa:10,ABa,10,22732,-2268,88,85,-2266,17.6,259,227,65,585043
ABal:25,ABal,25,22722,-2278,10,15,-2278,19.2,166,257,74,815432

Column Definitions:

Column	Meaning	Type	Used?	Notes
`cellTime`	Row ID	str	✗	Not used (duplicate of cell:time)
`cell`	Cell name	str	✓	C. elegans nomenclature (AB, ABa, ABal, etc.)
`time`	Absolute timepoint	int	✓	Typically 1-based timepoints
`none`	Unknown field	-	✗	Ignored
`global`	Global Z-index offset	int	✗	Unused offset info
`local`	Local offset	int	✗	Unused offset info
`blot`	Core fluorescence intensity	float	✓	Primary feature: cell identity / state
`cross`	Cross-correlation metric	float	✗	Ignored
`z`	3D depth coordinate (μm)	float	✓	Spatial feature
`x`	2D horizontal coordinate (px)	float	✓	Spatial feature
`y`	2D vertical coordinate (px)	float	✓	Spatial feature
`size`	Estimated cell volume (counts)	float	✓	Morphological feature
`gweight`	Image intensity weighting	float	✗	Unused weighting

1.3 Key Observations

One row = one cell at one timepoint (N × T sparse matrix of measurements)
Variable-length time series: Different embryos have different T (e.g., CD011605_5a_bright: ~210 timepoints, others vary)
Sparse data: Cells only appear after they are born. The AB cell appears from time 1; ABal appears from time 25
Naming conventions encode lineage: ABal is a daughter of ABa (remove last letter to get parent)

2. Preprocessing Pipeline

2.1 Architecture Overview

The preprocessing system has per-embryo (local) tensor construction:

Raw CSV (cell, time, x, y, z, size, blot)
    ↓
[build_local_index] → EpicIndex (local cell→idx mapping, t0, T)
    ↓
[populate_X & alive_mask] → Node features + birth/alive tracking
    ↓
[build_spatial_edges] → Euclidean proximity graph (undirected)
    ↓
[build_lineage_edges] → Divisional lineage graph (directed)
    ↓
[save_to_npz] → Compressed tensor archive

2.2 Core Functions

`build_local_index(df: pd.DataFrame) → EpicIndex`

Builds a per-embryo cell indexing structure.

Inputs:

df: DataFrame from raw EPIC CSV

Returns:

cell_to_idx: dict[str, int] — Maps cell name (e.g., “ABal”) → node index 0..N-1 (sorted alphabetically for reproducibility)
t0: int — First observed timepoint (typically 1)
T: int — Total number of timepoints = (t_max - t0) + 1

Logic:

cells = sorted(df["cell"].unique())  # Alphabetical order for reproducibility
t0 = min(time)
T = max(time) - t0 + 1
cell_to_idx = {cell: idx for idx, cell in enumerate(cells)}

Example (CD011605_5a_bright):

Cells found: {AB, ABa, ABal, ABp, ABpl, ABpr, …}
N = 688 unique cells across entire lineage
t0 = 1, T = 210 timepoints
cell_to_idx = {“AB”: 0, “ABa”: 1, “ABal”: 2, …}

`preprocess_epic_file_sparse(file_path, *, distance_threshold=20.0, features=("x","y","z","size","blot")) → tuple`

Main preprocessing function. Converts one raw CSV into tensors.

Returns:

X: np.ndarray              # (N, d, T) float32  — Node feature tensor
alive_mask: np.ndarray     # (N, T) bool        — Cell birth/alive tracking
edge_src: np.ndarray       # (E,) int32         — Source node indices
edge_dst: np.ndarray       # (E,) int32         — Destination node indices
edge_t: np.ndarray         # (E,) int32         — Edge timepoints (0-indexed)
index: EpicIndex           # Metadata (cell→idx, t0, T)

Step 1: Initialize Tensors

N = len(cell_to_idx)
T = index.T
d = 5  # features: [x, y, z, size, blot]

X = np.zeros((N, d, T), dtype=np.float32)
alive_mask = np.zeros((N, T), dtype=bool)

Zero-initialization ensures:

Unborn cells have zero-vector features (masked implicitly)
Memory-efficient sparse representation

Step 2: Populate X & alive_mask

For each row in the raw CSV:

c_idx = cell_to_idx[cell_name]
t_idx = time_value - t0  # Convert to 0-indexed
X[c_idx, :, t_idx] = [x, y, z, size, blot]
alive_mask[c_idx, t_idx] = True

This marks exactly when each cell becomes “alive” (observed).

Step 3: Build Spatial Edges (Undirected)

For each timepoint t:

Filter alive cells: Only include cells with alive_mask[_, t] == True

Compute pairwise distances: Using Euclidean metric on (x, y, z)

coords = alive_df[["x", "y", "z"]].values  # Shape (M, 3)
distances = pdist(coords, metric="euclidean")  # Pairwise distances
distances_matrix = squareform(distances)  # (M, M) distance matrix

Create edges when distance < threshold:

for i, idx_i in enumerate(alive_indices):
    for j, idx_j in enumerate(alive_indices):
        if distances[i,j] < distance_threshold and i != j:
            edge_src.append(idx_i)
            edge_dst.append(idx_j)
            edge_t.append(t_idx)

Default threshold: 20.0 micrometers (or pixels, depending on calibration)

Bidirectional: Both (i→j) and (j→i) edges are added to represent undirected spatial proximity.

Step 4: Build Lineage Edges (Directed)

C. elegans cells follow strict binary naming conventions:

AB divides → ABa + ABp (left vs right)
ABa divides → ABal + ABar (left vs right)
ABal divides → ABalp + ABalaa (anterior vs posterior)

Algorithm:

for cell in alive_cells_at_time_t:
    if cell[-1].isalpha():  # Last char is alphabetic
        parent_name = cell[:-1]  # Remove last letter
        if parent_name in cell_to_idx:
            p_idx = cell_to_idx[parent_name]
            c_idx = cell_to_idx[cell]
            edge_src.append(p_idx)
            edge_dst.append(c_idx)
            edge_t.append(t_idx)
            # Directed: parent → child (division arrow)

Key: This reconstructs the entire divisional lineage tree directly from cell names, without requiring explicit lineage tables.

2.3 Preprocessing Script

File: scripts/preprocess_dataset.py

Iterates over all raw EPIC CSVs and outputs one NPZ per embryo.

python scripts/preprocess_dataset.py \
    --raw_dir dataset/raw \
    --out dataset/processed/by_embryo \
    --distance_threshold 20

Processing Loop:

for file_path in sorted(dataset/raw/*.csv):
    X, alive, edge_src, edge_dst, edge_t, index = preprocess_epic_file_sparse(
        file_path,
        distance_threshold=20.0
    )
    # Save as compressed NPZ
    np.savez_compressed(
        f"dataset/processed/by_embryo/{file_stem}.npz",
        X=X,
        alive_mask=alive,
        edge_src=edge_src,
        edge_dst=edge_dst,
        edge_t=edge_t,
        idx_to_cell=idx_to_cell_array,
        t0=t0,
        T=T,
        source_file=original_filename
    )

Output manifest:

Creates dataset/processed/by_embryo/manifest.txt listing all 260 output files

3. Output Data: Processed NPZ Format

3.1 Output Location & Structure

dataset/processed/
└── by_embryo/
    ├── CD011505_end1red_bright.npz
    ├── CD011605_5a_bright.npz
    ├── ...  (258 more files)
    └── manifest.txt

3.2 NPZ Archive Contents

Each .npz file is a NumPy compressed archive containing a single embryo.

Load example:

data = np.load("CD011605_5a_bright.npz", allow_pickle=True)
X = data["X"]              # (N, d, T) float32
alive_mask = data["alive_mask"]
edge_src = data["edge_src"]
edge_dst = data["edge_dst"]
edge_t = data["edge_t"]
idx_to_cell = data["idx_to_cell"]
t0 = int(data["t0"])
T = int(data["T"])
source_file = str(data["source_file"])

3.3 Tensor Specifications

X: Node Feature Tensor (N, d, T)

Dimension	Size	Type	Meaning
N	688	int	Number of distinct cells in embryo
d	5	int	Feature dimension: [x, y, z, size, blot]
T	210	int	Time steps (1-indexed → 0-indexed)

Shape: (688, 5, 210) for CD011605_5a_bright

Data:

X[c_idx, :, t_idx] = [x, y, z, size, blot]

x, y (pixels): 2D position in image plane
z (μm): Depth from microscope focal plane
size (AU): Cell volume / morphological size
blot (AU): Fluorescence intensity (cell identity marker)

Masking:

Unborn cells: X[c_idx, :, t_idx] = [0, 0, 0, 0, 0] (zero vector)
Alive tracking: Use alive_mask[c_idx, t_idx] to identify valid measurements

alive_mask: Birth/Alive Tracking (N, T)

Dimension	Size	Type	Meaning
N	688	bool	Cell index
T	210	bool	Timepoint (True = alive, False = not born yet)

Shape: (688, 210)

Interpretation:

if alive_mask[c_idx, t_idx]:
    # Cell c_idx is alive (observed) at time t_idx
    # X[c_idx, :, t_idx] contains valid measurements
else:
    # Cell c_idx not yet born; X[c_idx, :, t_idx] is all zeros

Example (Cell ABal, born at time 25):

alive_mask[ABal_idx, 0:24]  = [False, False, ..., False]  (24 False values)
alive_mask[ABal_idx, 24:]   = [True, True, ..., True]     (186 True values)

Sparse Edge Lists: (edge_src, edge_dst, edge_t)

Instead of a full dense adjacency tensor A ∈ ℝ^(N×N×T) (688×688×210 = ~100M entries), edges are stored as three sparse arrays.

Array	Shape	Type	Meaning
edge_src	(E,)	int32	Source node indices
edge_dst	(E,)	int32	Destination node indices
edge_t	(E,)	int32	Timepoint (0-indexed)

Total edges (E): ~56,000 for CD011605_5a_bright

Reconstruction: From sparse to dense at time t:

A_t = np.zeros((N, N), dtype=float)
mask = edge_t == t
for k in np.where(mask)[0]:
    src = edge_src[k]
    dst = edge_dst[k]
    A_t[src, dst] = 1

Edge Types:

**Spatial edges (undirected, ≈45,000 edges/embryo):
- Created when distance(cell_i, cell_j) < 20 μm at time t
- Bidirectional: both (i→j) and (j→i) present
- Represents physical cell-cell contact / proximity
- Frequency varies with time (more cells = denser graph as development progresses)
Lineage edges (directed, ~1,000–5,000 edges/embryo):
- Parent → daughter division (ABa → ABal)
- Generated only once in development (at division time)
- Reconstructed from C. elegans naming conventions
- Example: At time t=25, lineage edges include all known divisions up to t

idx_to_cell: Cell Name Mapping (N,)

Field	Type	Meaning
idx_to_cell	(N,) object	Inverse of cell_to_idx

Shape: (688,) of dtype object (strings)

Content:

idx_to_cell[0] = "AB"
idx_to_cell[1] = "ABa"
idx_to_cell[2] = "ABal"
...
idx_to_cell[687] = "Zrp1aaa"  (one of the rearmost cells)

Usage: Convert predictions back to biological names:

predicted_idx = 42
cell_name = idx_to_cell[predicted_idx]  # "ABpr"

Metadata: t0, T, source_file

Field	Type	Meaning
t0	int32	First timepoint (usually 1)
T	int32	Total timepoints (210 for this embryo)
source_file	str	Original filename (e.g., “CD011605_5a_bright.csv”)

Used for:

Validating shape consistency across a batch
Tracking provenance to raw data
Reconstructing absolute timepoints: absolute_time = t0 + t_idx

3.4 Example: Spot-Check Statistics (CD011605_5a_bright.npz)

import numpy as np

data = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz", allow_pickle=True)

# Shapes
print("X shape:", data["X"].shape)              # (688, 5, 210)
print("alive_mask shape:", data["alive_mask"].shape)  # (688, 210)
print("Edges:", len(data["edge_src"]))          # 56,605 edges

# Birth times
print("First born cells:", data["idx_to_cell"][:10])
# ['AB', 'ABa', 'ABal', 'ABp', 'ABpl', 'ABpr', ...]

# Lineage verification
idx_ABa = np.where(data["idx_to_cell"] == "ABa")[0][0]
idx_ABal = np.where(data["idx_to_cell"] == "ABal")[0][0]
# Find edges where ABa → ABal
lineage_edges = (data["edge_src"] == idx_ABa) & (data["edge_dst"] == idx_ABal)
print("ABa → ABal edges:", lineage_edges.sum())  # ≈14 (one per timepoint from division onward)

# Data range
X = data["X"]
print("X statistics (only alive cells):")
print("x range:", X[:, 0, :][X[:, 0, :] > 0].min(), "-", X[:, 0, :].max())
print("blot range:", X[:, 4, :][X[:, 4, :] > 0].min(), "-", X[:, 4, :].max())

4. Biological Interpretation

4.1 C. elegans Cell Nomenclature

The cell naming system encodes complete lineage information hierarchically:

AB          : First zygote founder cell
├─ ABa      : Left daughter (after first division)
│  ├─ ABal  : Left-anterior daughter
│  │ ├─ ABalp   : Anterior division
│  │ └─ ABalaa  : Posterior division
│  └─ ABar  : Right-anterior daughter
└─ ABp      : Right daughter
   ├─ ABpl  : Left-posterior daughter
   └─ ABpr  : Right-posterior daughter

Naming rules:

Each mother cell divides into exactly 2 daughters
Daughters named by appending single letters: l/r (left/right), a/p (anterior/posterior), d/v (dorsal/ventral)
Last character removal = parent name

Example lineage edges:

AB → ABa (first division)
ABa → ABal (second division)
ABal → ABalp (third division)
ABalp → ABalpaa (fourth division)

4.2 Biological Meaning of Features & Graphs

Node Features (X)

Spatial (x, y, z): Track cell migration during development
size: Reflects cell volume changes during division cycle
blot: Fluorescence marker; indicates cell identity or developmental state

Spatial Edges

Physical interactions: Cells within 20 μm likely touching
Tissue context: Neighboring cells influence morphology & gene expression
Time-varying: Edges appear/disappear as cells migrate

Lineage Edges

Developmental ancestry: Directed acyclic graph (DAG) of cell divisions
Biological correctness: Encoded directly in C. elegans naming (not inferred)
Complete genealogy: Follows true developmental history perfectly

4.3 Embryonic Development Timeline Example

CD011605_5a_bright: 210 timepoints (≈840 seconds @ ~4 seconds/frame)

Time	Development Stage	Key Events
1–10	Early blastomere	AB cell at ~5 μm
10–25	Early cleavage	ABa/ABp born (1st div); ~4 cells observed
25–50	Early cleavage	ABal/ABel cells born; ~15–30 cells
50–100	Mid cleavage	~100–200 cells
100–150	Late cleavage	~400–500 cells
150–210	Early post-cleavage	~688 cells at full lineage

The alive_mask captures this—early timepoints have mostly zeros; later timepoints fill in as cells are born.

5. Memory & Performance

5.1 Storage Comparison: Dense vs. Sparse

Dense adjacency tensor would require:

A ∈ ℝ^(N×N×T) = (688 × 688 × 210 × 8 bytes)
                = 733 GiB (all 260 embryos)

Our sparse representation:

edge_src, edge_dst, edge_t ∈ ℝ^E (56,605 × 3 × 4 bytes)
                               = 0.68 MiB per embryo
All 260 embryos: ~177 MiB (compressed)

Reduction: ~4,000×

5.2 Compressed NPZ Performance

Compression ratio: ~3–5× reduction (sparse data compresses well)
Load time: ~100 ms per embryo (all tensors loaded into RAM)
Total memory (all 260): ~20 GB uncompressed

6. Usage Patterns

6.1 Loading a Single Embryo

import numpy as np

# Load one embryo
npz = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz", allow_pickle=True)

X = npz["X"]              # (688, 5, 210)
alive_mask = npz["alive_mask"]  # (688, 210)
edge_src = npz["edge_src"]
edge_dst = npz["edge_dst"]
edge_t = npz["edge_t"]
idx_to_cell = npz["idx_to_cell"]

# Extract features at timepoint t=100
t = 100
X_t = X[:, :, t]       # (688, 5)
alive_t = alive_mask[:, t]  # (688,)

# Only look at alive cells
alive_indices = np.where(alive_t)[0]
X_active = X[alive_indices, :, t]  # (M, 5) where M = number of alive cells

print(f"At time {t}: {len(alive_indices)} cells alive")

6.2 Working with Edges

# Get spatial + lineage edges at time t=100
mask = edge_t == t
edges_at_t_src = edge_src[mask]
edges_at_t_dst = edge_dst[mask]

# Build adjacency matrix for time t
A_t = np.zeros((688, 688))
A_t[edges_at_t_src, edges_at_t_dst] = 1

print(f"At time {t}: {mask.sum()} edges")
print(f"Density: {mask.sum() / (688*688):.4f}")

6.3 Reconstructing Cell Names

# Forward mapping: cell name → index
cell_to_idx = {cell: idx for idx, cell in enumerate(idx_to_cell)}

# Reverse mapping: index → cell name
predicted_nodes = [42, 71, 153]
for idx in predicted_nodes:
    print(f"Node {idx} = cell {idx_to_cell[idx]}")

6.4 Lineage Tree Traversal

def get_lineage_tree(idx_to_cell, edge_src, edge_dst, edge_t):
    """Extract parent-daughter relationships."""
    lineage_graph = {}
    
    for src, dst in zip(edge_src, edge_dst):
        src_name = idx_to_cell[src]
        dst_name = idx_to_cell[dst]
        
        # Identify lineage edges: parent name is substring of daughter
        if len(dst_name) > len(src_name) and src_name == dst_name[:-1]:
            if src_name not in lineage_graph:
                lineage_graph[src_name] = []
            lineage_graph[src_name].append(dst_name)
    
    return lineage_graph

lineage = get_lineage_tree(idx_to_cell, edge_src, edge_dst, edge_t)
print("AB daughters:", lineage["AB"])       # ['ABa', 'ABp']
print("ABa daughters:", lineage["ABa"])     # ['ABal', 'ABar']

7. File Manifest & Provenance

7.1 manifest.txt

Lists all 260 processed embryo filenames:

CD011505_end1red_bright.npz
CD011605_5a_bright.npz
CD030906_dyf7red.npz
...
CD20091127_pgp-2_5_L2.npz

Use to:

Verify all 260 embryos preprocessed
Loop over all outputs programmatically
Track which raw files were processed

manifest_path = "dataset/processed/by_embryo/manifest.txt"
embryo_files = [line.strip() for line in open(manifest_path).readlines()]
print(f"Total embryos: {len(embryo_files)}")  # 260

7.2 Provenance Tracking

Each NPZ embeds metadata for traceability:

npz = np.load("CD011605_5a_bright.npz", allow_pickle=True)
print("Original file:", npz["source_file"])  # "CD011605_5a_bright.csv"
print("Time range:", int(npz["t0"]), "to", int(npz["t0"]) + int(npz["T"]) - 1)

8. Quality Control & Validation

8.1 Sanity Checks

def validate_npz(npz):
    """Verify tensor integrity."""
    X = npz["X"]
    alive = npz["alive_mask"]
    edge_src = npz["edge_src"]
    edge_dst = npz["edge_dst"]
    edge_t = npz["edge_t"]
    idx_to_cell = npz["idx_to_cell"]
    
    N, d, T = X.shape
    
    # Check shape consistency
    assert alive.shape == (N, T), f"alive_mask shape mismatch"
    assert d == 5, f"Feature dimension must be 5, got {d}"
    
    # Check index validity
    assert np.max(edge_src) < N and np.min(edge_src) >= 0
    assert np.max(edge_dst) < N and np.min(edge_dst) >= 0
    assert np.max(edge_t) < T and np.min(edge_t) >= 0
    
    # Check alive_mask consistency with X
    for t in range(T):
        unborn_mask = ~alive[:, t]
        X_unborn = X[unborn_mask, :, t]
        assert np.allclose(X_unborn, 0), f"Unborn cells should have zero features at t={t}"
    
    # Check cell names
    assert len(idx_to_cell) == N, f"idx_to_cell length mismatch"
    assert len(np.unique(idx_to_cell)) == N, f"Duplicate cell names"
    
    print("✓ All checks passed")

validate_npz(npz)

8.2 Per-Embryo Statistics Template

def summarize_embryo(npz_path):
    """Print detailed summary of one embryo."""
    npz = np.load(npz_path, allow_pickle=True)
    X = npz["X"]
    alive = npz["alive_mask"]
    edge_src = npz["edge_src"]
    edge_dst = npz["edge_dst"]
    
    N, d, T = X.shape
    
    print(f"\n{Path(npz_path).stem}")
    print("=" * 50)
    print(f"Cells (N):                {N}")
    print(f"Features (d):             {d}")
    print(f"Timepoints (T):           {T}")
    print(f"Total edges:              {len(edge_src)}")
    print(f"Spatial edges (approx):   {len(edge_src) - 5000}")  # rough est.
    print(f"Birth times:")
    
    # Birth time for each cell
    births = alive.argmax(axis=1)
    print(f"  Earliest: t={births.min()} ({X.shape[2] - births.min()} timepoints)")
    print(f"  Latest:   t={births.max()}")
    
    # Lifespan stats
    lifespans = (alive.sum(axis=1))
    print(f"  Mean lifespan: {lifespans[lifespans > 0].mean():.1f} timepoints")
    print(f"  Max lifespan:  {lifespans.max()}")

summarize_embryo("dataset/processed/by_embryo/CD011605_5a_bright.npz")

9. Common Operations & Recipes

9.1 Batch Loading

import numpy as np
from pathlib import Path

def load_all_embryos(base_dir="dataset/processed/by_embryo", max_embryos=None):
    """Load all processed embryos with consistent filtering."""
    npz_files = sorted(Path(base_dir).glob("*.npz"))
    if max_embryos:
        npz_files = npz_files[:max_embryos]
    
    embryos = []
    for fp in npz_files:
        npz = np.load(fp, allow_pickle=True)
        embryos.append({
            "name": fp.stem,
            "X": npz["X"],
            "alive": npz["alive_mask"],
            "edges": (npz["edge_src"], npz["edge_dst"], npz["edge_t"]),
            "idx_to_cell": npz["idx_to_cell"],
        })
    return embryos

embryos = load_all_embryos(max_embryos=10)
print(f"Loaded {len(embryos)} embryos")

9.2 Time-Averaged Graphs

def get_time_averaged_graph(npz, start_t, end_t):
    """Average topology over a time window."""
    edge_src = npz["edge_src"]
    edge_dst = npz["edge_dst"]
    edge_t = npz["edge_t"]
    N = npz["X"].shape[0]
    
    mask = (edge_t >= start_t) & (edge_t < end_t)
    A = np.zeros((N, N))
    A[edge_src[mask], edge_dst[mask]] = 1
    
    return A

A_cleavage = get_time_averaged_graph(npz, start_t=0, end_t=100)
A_migration = get_time_averaged_graph(npz, start_t=100, end_t=210)

print("Cleavage graph density:", np.count_nonzero(A_cleavage) / (688*688))
print("Migration graph density:", np.count_nonzero(A_migration) / (688*688))

9.3 Feature Histograms

import matplotlib.pyplot as plt

def plot_feature_stats(npz):
    """Visualize feature distributions (alive cells only)."""
    X = npz["X"]
    alive = npz["alive_mask"]
    feature_names = ["x (px)", "y (px)", "z (μm)", "size (AU)", "blot (AU)"]
    
    fig, axes = plt.subplots(1, 5, figsize=(15, 3))
    for d in range(5):
        X_alive = X[:, d, :][alive]
        X_alive = X_alive[X_alive > 0]  # Remove zeros (unborn)
        axes[d].hist(X_alive, bins=50, alpha=0.7)
        axes[d].set_title(feature_names[d])
        axes[d].set_xlabel("Value")
        axes[d].set_ylabel("Count")
    plt.tight_layout()
    plt.savefig("feature_stats.png")
    plt.show()

plot_feature_stats(npz)

10. Summary: Data Pipeline

┌─────────────────────────────────────────────────────────┐
│                    RAW EPIC CSVs                         │
│  dataset/raw/*.csv (260 files)                           │
│  Format: cell, time, x, y, z, size, blot, ...           │
│  N×T sparse observations per embryo                     │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ scripts/preprocess_dataset.py
                     │ + src/epic_preprocess.py
                     │
        ┌────────────▼────────────┐
        │  Per-Embryo Conversion  │
        │                         │
        │  • build_local_index    │
        │  • populate_X,alive_mk  │
        │  • spatial_edges        │
        │  • lineage_edges        │
        │                         │
        └────────────┬────────────┘
                     │
┌────────────────────▼────────────────────────────────────┐
│          PROCESSED NPZ TENSORS                           │
│  dataset/processed/by_embryo/*.npz (260 files)          │
│                                                          │
│  Per-embryo (local):                                    │
│  • X[N, d=5, T]: node features (sparse via mask)       │
│  • alive_mask[N, T]: cell birth/lifespan tracking       │
│  • edge_src/dst/t: sparse timepoint graphs              │
│  • idx_to_cell[N]: cell name mapping                    │
│  • Metadata: t0, T, source_file                         │
│                                                          │
│  Total: 260 embryos × ~57k edges × 688 cells           │
│  Memory: ~177 MiB (compressed)                          │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ Model training / analysis
                     │
        ┌────────────▼────────────┐
        │  Graph Neural Network   │
        │  Training & Inference   │
        │                         │
        │  • Spatio-temporal GNN  │
        │  • Supervised learning  │
        │  • Cell prediction      │
        │  • Edge attention       │
        │                         │
        └─────────────────────────┘

11. Troubleshooting

Issue: Missing cells in early times

Cause: Cells not yet born (before division)
Solution: Check alive_mask before using X data

Issue: Disconnected spatial graph

Cause: distance_threshold too small
Solution: Increase threshold (e.g., 30 instead of 20)

Issue: Memory error loading all 260 embryos

Cause: RAM limit (need ~20 GB for full batch)
Solution: Load per-embryo or use data generator

Issue: Cell names don’t match lineage

Cause: Malformed cell names in raw CSV
Solution: Preprocess cell names (strip whitespace, uppercase)

References

EPIC Consortium: http://epic.gs.washington.edu/
C. elegans anatomy: http://www.wormatlas.org/
Cell nomenclature standard: Sulston et al. (1983). Phil Trans R Soc London

Document Version: 1.0
Last Updated: April 2026
Pipeline Code: src/epic_preprocess.py

EPIC Database Preprocessing & Tensor Conversion Documentation#

Overview#

1. Input Data: Raw EPIC CSV Format#

1.1 Source Files#

1.2 Raw CSV Structure#

1.3 Key Observations#

2. Preprocessing Pipeline#

2.1 Architecture Overview#

2.2 Core Functions#

build_local_index(df: pd.DataFrame) → EpicIndex#

preprocess_epic_file_sparse(file_path, *, distance_threshold=20.0, features=("x","y","z","size","blot")) → tuple#

2.3 Preprocessing Script#

3. Output Data: Processed NPZ Format#

3.1 Output Location & Structure#

3.2 NPZ Archive Contents#

3.3 Tensor Specifications#

X: Node Feature Tensor (N, d, T)#

alive_mask: Birth/Alive Tracking (N, T)#

Sparse Edge Lists: (edge_src, edge_dst, edge_t)#

idx_to_cell: Cell Name Mapping (N,)#

Metadata: t0, T, source_file#

3.4 Example: Spot-Check Statistics (CD011605_5a_bright.npz)#

4. Biological Interpretation#

4.1 C. elegans Cell Nomenclature#

4.2 Biological Meaning of Features & Graphs#

Node Features (X)#

Spatial Edges#

Lineage Edges#

4.3 Embryonic Development Timeline Example#

5. Memory & Performance#

5.1 Storage Comparison: Dense vs. Sparse#

5.2 Compressed NPZ Performance#

6. Usage Patterns#

6.1 Loading a Single Embryo#

6.2 Working with Edges#

6.3 Reconstructing Cell Names#

6.4 Lineage Tree Traversal#

7. File Manifest & Provenance#

7.1 manifest.txt#

7.2 Provenance Tracking#

8. Quality Control & Validation#

8.1 Sanity Checks#

8.2 Per-Embryo Statistics Template#

9. Common Operations & Recipes#

9.1 Batch Loading#

9.2 Time-Averaged Graphs#

9.3 Feature Histograms#

10. Summary: Data Pipeline#

11. Troubleshooting#

Issue: Missing cells in early times#

Issue: Disconnected spatial graph#

Issue: Memory error loading all 260 embryos#

Issue: Cell names don’t match lineage#

References#