EPIC Database Schema Reference
Dataset Specification
Project: Spatio-Temporal Evolution (C. elegans EPIC embryos)
Format Version: 1.0
Schema Date: April 2026
Overview
Raw Data (260 CSV files)
↓
Per-embryo processing
↓
Processed Data (260 NPZ archives)
1. Input Schema: Raw EPIC CSV
File Format
- Location:
dataset/raw/*.csv - Format: CSV (comma-delimited)
- Count: 260 files (one per embryo)
- Encoding: UTF-8
Column Specification
| Column | Type | Unit | Range | Required | Used? | Notes |
|---|---|---|---|---|---|---|
cellTime |
string | — | — | ✓ | ✗ | Row ID (e.g., “AB:1”); duplicate of cell:time |
cell |
string | — | — | ✓ | ✓ | C. elegans cell name (e.g., “AB”, “ABal”, “Zrp1aaa”) |
time |
integer | frame | 1–210 | ✓ | ✓ | Absolute timepoint (1-indexed) |
none |
integer | — | — | ✓ | ✗ | Unknown field; ignored |
global |
integer | — | — | ✓ | ✗ | Global offset; unused |
local |
integer | — | — | ✓ | ✗ | Local offset; unused |
blot |
float | AU | 100–10M | ✓ | ✓ | Primary identifier: fluorescence intensity |
cross |
float | — | — | ✓ | ✗ | Cross-correlation; unused |
z |
float | μm | 0–200 | ✓ | ✓ | Depth (focal plane offset) |
x |
float | pixel | 0–512 | ✓ | ✓ | Horizontal position (X-axis) |
y |
float | pixel | 0–512 | ✓ | ✓ | Vertical position (Y-axis) |
size |
float | AU | 10–5000 | ✓ | ✓ | Cell volume / morphological size |
gweight |
float | — | — | ✓ | ✗ | Image intensity weight; unused |
Constraints
- Unique key: (cell, time) — only one measurement per cell per timepoint
- Cell naming: C. elegans nomenclature compliant (alphanumeric + letters encode lineage)
- Time ordering: Rows typically in chronological order but not guaranteed
- No missing values: All columns present in every row (though some unused)
- Sparse data: Only living cells appear; births create new rows
Example Record
cellTime,cell,time,none,global,local,blot,cross,z,x,y,size,gweight
ABal:25,ABal,25,22722,-2278,10,815432,-2278,19.2,166,257,74,815432
2. Processing Schema: Transformation Rules
Input Transformation
Cell Name Normalization
Input: arbitrary string from CSV
ProcessProcessing:
1. Strip whitespace
2. Ensure alphabetic characters only (or underscores for legacy names)
3. Preserve case (C. elegans uses ABa vs ABp, not aba vs abp)
Output: cell_to_idx mapping (sorted alphabetically)
Time Indexing
Input: time (1-indexed, 1 to T_max)
Processing:
1. t0 = min(time_in_csv)
2. T = max(time_in_csv) - t0 + 1
3. t_idx = time - t0 (convert to 0-indexed)
Output: t_idx in range [0, T-1]
Feature Extraction
Input: x, y, z, size, blot columns
Processing:
- Convert all to float32
- Verify non-negative (except z can be ~0)
- Stack into features array [x, y, z, size, blot]
Output: X[c_idx, :, t_idx] = [x, y, z, size, blot]
Alive Masking
Input: presence of (cell, time) row in CSV
Processing:
- If row exists: alive_mask[c_idx, t_idx] = True
- If no row: alive_mask[c_idx, t_idx] = False (never populated)
Output: alive_mask[N, T] boolean array
Edge Construction
Spatial edges (undirected proximity):
Input: (x, y, z) coordinates of all alive cells at time t
Processing:
1. Compute pairwise Euclidean distance
2. For each pair (i, j) with distance < threshold (20 μm):
- Add edge i → j
- Add edge j → i (undirected)
3. Store as (edge_src, edge_dst, edge_t) triplets
Output: ~45,000 edges per embryo (mostly spatial)
Lineage edges (directed ancestry):
Input: cell names at time t
Processing:
1. For each cell alive at time t:
- If last character is alphabetic:
parent = cell[:-1] (remove last character)
if parent exists and is alive:
add edge: parent → cell
2. Store as (edge_src, edge_dst, edge_t) triplets
Output: ~5,000 edges per embryo (genealogy)
Caveat: Only applies when last char is alpha (not for "P1", cells with underscores, etc.)
3. Output Schema: Processed NPZ Archive
Archive Format
- Type: NumPy
.npz(compressed NumPy archive, ZIP format) - Compression: DEFLATE (automatic via
np.savez_compressed()) - Compression ratio: ~3–5× typical for this data
Archive Contents
X: Node Feature Tensor
Path in NPZ: X
Type: np.ndarray
DType: float32
Shape: (N, d, T) where:
- N = number of cells (varies per embryo, typically 600–750)
- d = 5 (fixed feature dimension)
- T = number of timepoints (varies per embryo, typically 150–250)
Semantics:
X[c_idx, f_idx, t_idx] = value
where:
c_idx ∈ [0, N-1] — cell index
f_idx ∈ {0,1,2,3,4} — feature: 0=x, 1=y, 2=z, 3=size, 4=blot
t_idx ∈ [0, T-1] — timepoint (0-indexed)
value ∈ ℝ or 0 — feature value or 0 if unborn
Constraints:
if alive_mask[c_idx, t_idx] == False:
X[c_idx, :, t_idx] == [0, 0, 0, 0, 0] (all zeros)
if alive_mask[c_idx, t_idx] == True:
X[c_idx, 0, t_idx] ∈ [0, ~512] (x, pixels)
X[c_idx, 1, t_idx] ∈ [0, ~512] (y, pixels)
X[c_idx, 2, t_idx] ∈ [0, ~200] (z, micrometers)
X[c_idx, 3, t_idx] ∈ [10, ~5000] (size, arbitrary units)
X[c_idx, 4, t_idx] ∈ [100, 10M] (blot, arbitrary units)
Feature Definitions:
| Index | Name | Unit | Physical Meaning |
|---|---|---|---|
| 0 | x | pixels | Horizontal position |
| 1 | y | pixels | Vertical position |
| 2 | z | micrometers | Depth from focal plane |
| 3 | size | AU | Cell volume / morphology |
| 4 | blot | AU | Fluorescence intensity (identity marker) |
alive_mask: Birth/Alive Tracking
Path in NPZ: alive_mask
Type: np.ndarray
DType: bool
Shape: (N, T)
Semantics:
alive_mask[c_idx, t_idx] = True if cell is observed at this timepoint
alive_mask[c_idx, t_idx] = False if cell not yet born or unobserved
Constraints:
Monotonicity: Once False, always False before first True (no resurrection)
if alive_mask[c_idx, t] == True and t' < t:
then ∃ t0 such that:
alive_mask[c_idx, 0:t0] == False
alive_mask[c_idx, t0:] == True
Use Cases:
- Identify cell birth time:
t_birth = argmax(alive_mask[c_idx]) - Extract valid features:
X_valid = X[alive_mask] - Compute lifespan:
lifespan = alive_mask.sum(axis=1)
edge_src, edge_dst, edge_t: Sparse Edge Lists
Paths in NPZ:
edge_src— source node indicesedge_dst— destination node indicesedge_t— timepoint indices
Type: np.ndarray
DType: int32
Shape: (E,) where E ≈ 56,000 per embryo
Semantics:
An edge exists between node edge_src[k] → edge_dst[k] at time edge_t[k]
if and only if k ∈ [0, E-1]
Constraints:
edge_src[k] ∈ [0, N-1] for all k
edge_dst[k] ∈ [0, N-1] for all k
edge_t[k] ∈ [0, T-1] for all k
Assumption: No duplicate edges (same src, dst, t)
Edge Type Classification:
To distinguish spatial from lineage edges:
def classify_edge(src_idx, dst_idx, idx_to_cell):
src_name = idx_to_cell[src_idx]
dst_name = idx_to_cell[dst_idx]
# Lineage: daughter is parent + one letter
if (len(dst_name) == len(src_name) + 1 and
dst_name.startswith(src_name)):
return "lineage"
else:
return "spatial"
idx_to_cell: Cell Name Mapping
Path in NPZ: idx_to_cell
Type: np.ndarray
DType: object (Python strings)
Shape: (N,)
Semantics:
idx_to_cell[c_idx] = cell_name (string)
Example:
idx_to_cell[0] = "AB"
idx_to_cell[1] = "ABa"
idx_to_cell[2] = "ABal"
...
idx_to_cell[687] = "Zrp1aaa"
Constraints:
All unique (no duplicate cell names)
All alphabetic characters (and hyphens for legacy names)
Sorted alphabetically (for reproducibility)
len(idx_to_cell) == N
Inverse Mapping (derived):
cell_to_idx = {cell: idx for idx, cell in enumerate(idx_to_cell)}
t0, T: Time Metadata
Paths in NPZ:
t0— first absolute timepointT— total number of timepoints
Type: np.ndarray (scalar or 0-d array)
DType: int32
Shape: () (scalar)
Semantics:
t0 = t_min from CSV (typically 1)
T = t_max - t_min + 1
Conversion:
t_absolute = t0 + t_index
t_index = t_absolute - t0
Example:
Raw CSV times: 1, 2, 3, ..., 210
t0 = 1
T = 210
t_index = 24 → t_absolute = 1 + 24 = 25
source_file: Provenance
Path in NPZ: source_file
Type: np.ndarray
DType: object (Python string)
Shape: () (scalar)
Value: Original CSV filename (e.g., “CD011605_5a_bright.csv”)
Use: Trace back to raw data for reprocessing or verification
NPZ Summary Table
| Field | Shape | DType | Size (MB) | Purpose |
|---|---|---|---|---|
| X | (N, 5, T) | float32 | ~13 | Node features |
| alive_mask | (N, T) | bool | ~0.2 | Birth tracking |
| edge_src | (E,) | int32 | ~0.2 | Edge list |
| edge_dst | (E,) | int32 | ~0.2 | Edge list |
| edge_t | (E,) | int32 | ~0.2 | Edge list |
| idx_to_cell | (N,) | object | ~0.02 | Name mapping |
| t0 | () | int32 | ~0.00001 | Metadata |
| T | () | int32 | ~0.00001 | Metadata |
| source_file | () | object | ~0.00003 | Provenance |
| Total (uncompressed) | — | — | ~14 | — |
| Total (compressed) | — | — | ~0.7 | ~20× compression |
4. Relationships & Constraints
Dimension Consistency
X.shape[0] == alive_mask.shape[0] == len(idx_to_cell) == N
X.shape[1] == 5 (features)
X.shape[2] == alive_mask.shape[1] == T
Index Validity
∀ k ∈ [0, E-1]:
edge_src[k] < N
edge_dst[k] < N
edge_t[k] < T
Data Integrity
∀ c ∈ [0, N-1], t ∈ [0, T-1]:
if alive_mask[c, t] == False:
X[c, :, t] == [0, 0, 0, 0, 0]
Uniqueness
len(unique(idx_to_cell)) == N (no duplicate names)
∀ (src, dst, t): unique triplets (no multi-edges)
Temporal Ordering
t0 ≤ t0 + 1 ≤ ... ≤ t0 + T - 1 (times monotonic)
Lineage Consistency
∀ lineage edge (parent_idx, daughter_idx):
parent_name = idx_to_cell[parent_idx]
daughter_name = idx_to_cell[daughter_idx]
daughter_name == parent_name + single_letter
5. Serialization Details
NPZ Format (ZIP-based)
file.npz (actually a ZIP archive)
├── X.npy
├── alive_mask.npy
├── edge_src.npy
├── edge_dst.npy
├── edge_t.npy
├── idx_to_cell.npy
├── t0.npy
├── T.npy
└── source_file.npy
Loading
import numpy as np
# Load
npz = np.load("file.npz", allow_pickle=True)
# Access (two equivalent ways)
X = npz["X"] # Dictionary-like
X = npz.files["X"] # Via file listing
# Close (optional but recommended)
npz.close()
Saving
import numpy as np
np.savez_compressed(
"file.npz",
X=X,
alive_mask=alive_mask,
edge_src=edge_src,
edge_dst=edge_dst,
edge_t=edge_t,
idx_to_cell=idx_to_cell,
t0=np.int32(t0),
T=np.int32(T),
source_file=source_file
)
6. Data Quality Standards
Validation Checklist
Shape consistency (must pass)
assert X.shape[0] == alive.shape[0] == len(idx_to_cell)
assert X.shape[1] == 5
assert X.shape[2] == alive.shape[1]
assert len(edge_src) == len(edge_dst) == len(edge_t)
Index validity (must pass)
assert 0 <= edge_src.min() and edge_src.max() < N
assert 0 <= edge_dst.min() and edge_dst.max() < N
assert 0 <= edge_t.min() and edge_t.max() < T
Masking consistency (must pass)
assert (X[~alive] == 0).all() # Unborn cells all zero
Name uniqueness (must pass)
assert len(np.unique(idx_to_cell)) == N
Reasonable ranges (should pass)
X_alive = X[alive]
assert 0 <= X_alive[:, :3].max() < 1000 # Coordinates
assert 0 < X_alive[:, 3:].min() # Size, blot positive
7. Version History
| Version | Date | Changes |
|---|---|---|
| 1.0 | April 2026 | Initial schema definition; per-embryo tensors, sparse edges |
8. References
Raw EPIC data source: http://epic.gs.washington.edu/
C. elegans nomenclature: Sulston et al. (1983)
Distance threshold: 20 micrometers (standard contact distance)
Schema Document Version: 1.0
Last Updated: April 2026