EPIC Database Documentation Index

Quick Navigation

πŸ“– For First-Time Users

  1. Start here: QUICK_REFERENCE.md β€” 5-minute overview
  2. Then read: DATABASE_DOCUMENTATION.md β€” Sections 1–3 (input, processing, output)
  3. Try this: Run python scripts/usage_examples.py to see working code

πŸ—οΈ For System Design

  • ARCHITECTURE.md β€” Data flow diagrams, tensor construction, edge building logic

πŸ’» For Implementation

πŸ” For Debugging & Validation

πŸ“Š For Data Analysis


File Manifest

docs/
β”œβ”€β”€ README.md (this file)
β”œβ”€β”€ DATABASE_DOCUMENTATION.md      (Comprehensive: 11 sections, ~1000 lines)
β”œβ”€β”€ QUICK_REFERENCE.md             (Quick-lookup: 10 sections, ~300 lines)
β”œβ”€β”€ ARCHITECTURE.md                (Visual/technical: data flow, queries)
β”‚
scripts/
β”œβ”€β”€ preprocess_dataset.py           (Main preprocessing script)
β”œβ”€β”€ usage_examples.py               (7 runnable examples)
β”œβ”€β”€ make_figures.py                 (Visualization utils)
β”‚
src/
β”œβ”€β”€ epic_preprocess.py              (Core preprocessing functions)
β”‚   β”œβ”€β”€ build_local_index()
β”‚   β”œβ”€β”€ preprocess_epic_file_sparse()
β”‚   └── helper functions
β”‚
dataset/
β”œβ”€β”€ raw/                            (260 *.csv files β€” raw EPIC input)
β”‚   β”œβ”€β”€ CD011505_end1red_bright.csv
β”‚   β”œβ”€β”€ CD011605_5a_bright.csv
β”‚   β”œβ”€β”€ ... (258 more)
β”‚
└── processed/by_embryo/            (260 *.npz files β€” processed output)
    β”œβ”€β”€ CD011505_end1red_bright.npz
    β”œβ”€β”€ CD011605_5a_bright.npz
    β”œβ”€β”€ ... (258 more)
    └── manifest.txt

Document Overview

πŸ“„ DATABASE_DOCUMENTATION.md (Comprehensive Reference)

Length: ~1200 lines | Audience: Everyone (but especially developers)

Section Content Use Case
1 Raw EPIC CSV format & columns Understanding input data
2 Preprocessing pipeline details How data is transformed
3 Output NPZ tensors specifications Data format reference
4 Biological interpretation Understanding C. elegans
5 Memory & performance System requirements
6 Usage patterns How to load/query data
7 File manifest & provenance Tracking & validation
8 QC & validation checks Data quality assurance
9 Common operations & recipes Code examples
10 Summary pipeline diagram High-level overview
11 Troubleshooting Error handling

πŸ“‹ QUICK_REFERENCE.md (Quick Lookup)

Length: ~300 lines | Audience: Experienced users needing fast reference

Section Content
Structure 5-second file layout
Dimensions Tensor shapes at a glance
MWE Minimal working example
Features Column definitions table
Masking Handling unborn cells
Mapping Cell name ↔ index conversions
Edge Lists Sparse β†’ dense example
Contact vs. Lineage Separating edge types
Time Windows Temporal slicing
Batch Loading Multi-embryo patterns
QC Checklist Validation tests
Performance Tips & tricks
Metadata Tracking info

πŸ›οΈ ARCHITECTURE.md (System Design & Data Flow)

Length: ~400 lines |Audience: System designers, advanced users

Section Content
System Diagram Full pipeline ASCII art
Tensor Relationships Dimension mapping & constraints
CSV β†’ Tensor Example cell trajectory
Spatial Edge Logic Proximity algorithm
Lineage Edge Logic Ancestry naming reconstruction
Alive Masking Birth tracking data flow
NPZ Structure File contents breakdown
Sparse Compression Memory savings analysis
Query Patterns Example SQL-like queries
Reproducibility Determinism guarantees
Performance Benchmarks & scalability

Usage Patterns by Role

πŸ§ͺ Data Scientist / ML Researcher

Goal: Train models on cell dynamics

Read:

  1. QUICK_REFERENCE.md (Section “Minimal Working Example”)
  2. DATABASE_DOCUMENTATION.md (Sections 3, 6, 9)
  3. scripts/usage_examples.py (Examples 1, 2, 6)

Code pattern:

# Load batch of embryos
from pathlib import Path
import numpy as np

for npz_path in sorted(Path("dataset/processed/by_embryo").glob("*.npz"))[:10]:
    X = np.load(npz_path)["X"]  # (N, 5, T)
    # Train your model...

πŸ”¬ Biologist / Domain Expert

Goal: Understand cell behavior, verify data quality

Read:

  1. DATABASE_DOCUMENTATION.md (Sections 1, 4, 8)
  2. QUICK_REFERENCE.md (Sections “Feature Definitions”, “Cell Name ↔ Index”)
  3. scripts/usage_examples.py (Examples 3, 5)

Code pattern:

# Track a cell's developmental history
cell_name = "ABal"
trajectory = [X[cell_idx, :, t] for t in range(T) if alive_mask[cell_idx, t]]
# Analyze migration, division timing, etc.

πŸ› οΈ Software Engineer / DevOps

Goal: Process data, manage pipelines, ensure data quality

Read:

  1. DATABASE_DOCUMENTATION.md (Sections 2, 5, 7, 8, 11)
  2. ARCHITECTURE.md (Sections “Data Pipeline”, “Validation”)
  3. scripts/preprocess_dataset.py (runnable script)

Code pattern:

# Run full preprocessing pipeline
python scripts/preprocess_dataset.py \
    --raw_dir dataset/raw \
    --out dataset/processed/by_embryo \
    --distance_threshold 20

# Verify output
python scripts/usage_examples.py  # Run all QC checks

πŸ“š Student / New Contributor

Goal: Understand the system from scratch

Path:

  1. Week 1: QUICK_REFERENCE.md (all sections)
  2. Week 1: Run python scripts/usage_examples.py
  3. Week 2: Read DATABASE_DOCUMENTATION.md (sections 1–6)
  4. Week 2: Study ARCHITECTURE.md diagrams
  5. Week 3: Dive into src/epic_preprocess.py code

FAQ: Where Do I Find…?

Question Answer
“What are the feature columns?” DB_DOC Β§1.2, QUICK_REF
“How do I load an embryo?” QUICK_REF, Example 1
“What’s the shape of X?” DB_DOC Β§3.3, QUICK_REF Β§ Dimensions
“Why are some cells all zeros?” DB_DOC Β§3.3, QUICK_REF Β§Masking
“How are edges encoded?” DB_DOC Β§3.3, ARCH
“What do lineage edges mean?” DB_DOC Β§4.2, ARCH Β§ Lineage
“How do I find a cell’s daughters?” QUICK_REF, Example 3
“How much memory do I need?” DB_DOC Β§5.2, ARCH Β§ Scalability
“How do I run preprocessing?” DB_DOC Β§2.3, README.md
“How do I validate my data?” DB_DOC Β§8, Example 7

Key Concepts Glossary

Core Terms

Term Definition See Also
EPIC eMbryo Project Imaging Consortium β€” fluorescence microscopy dataset of C. elegans development DB_DOC Β§1.1
Embryo One complete developmental recording; stored as one NPZ file All docs
Cell Individual nucleus tracked through development; identified by C. elegans nomenclature (e.g., “ABal”) DB_DOC Β§4.1
Timepoint One frame of video; multiple rows per timepoint (one per cell) DB_DOC Β§1.2
Node Synonym for cell when represented as graph node; indexed 0..N-1 All docs
Feature Measured property of a cell: x, y, z, size, blot (d=5 dimensions) QUICK_REF Β§ Features
Edge Connection between two cells; either spatial (proximity) or lineage (ancestry) ARCH Β§ Edge Construction
Spatial edge Undirected edge connecting physically close cells (< 20 ΞΌm) DB_DOC Β§2.2
Lineage edge Directed edge from parent to daughter cell (inferred from naming) DB_DOC Β§2.2
Tensor Multi-dimensional NumPy array; e.g., X[N, d, T] DB_DOC Β§3
Sparse graph Edge list representation: (edge_src, edge_dst, edge_t) vs. dense adjacency A[N,N,T] DB_DOC Β§5.1
alive_mask Boolean tensor [N, T] indicating when each cell is born and observed DB_DOC Β§3.3
Masked cell Cell not yet born; has zero features and alive_mask[cell,t]=False QUICK_REF

Biological Terms

Term Definition See Also
C. elegans Caenorhabditis elegans β€” nematode worm; standard model organism DB_DOC Β§4
Embryo Developmental stage from zygote (~1 cell) to ~700 cells DB_DOC Β§4.3
Cell division Binary fission: one mother β†’ two daughters; tracked via lineage tree DB_DOC Β§4.1
Lineage Ancestry tree; graph of cell divisions from fertilized egg to final cells ARCH Β§ Lineage Edges
Cell naming Standard nomenclature encoding lineage: AB→ABa→ABal (last character = division history) DB_DOC §4.1
Fluorescence “blot” feature; intensity of fluorescent marker for cell tracking DB_DOC Β§1.2
Migration Cell movement tracked via (x, y, z) coordinates Example 5

Troubleshooting Guide

Problem: “FileNotFoundError: No such file or directory”

Likely cause: You haven’t run preprocessing yet

Solution:

cd d:\Github\Spatio-Temporal-Evolution
python scripts/preprocess_dataset.py --raw_dir dataset/raw --out dataset/processed/by_embryo

See DB_DOC Β§2.3


Problem: “KeyError: ‘X’”

Likely cause: Wrong file format; trying to load non-NPZ file

Solution:

import numpy as np
# Correct: load from processed/by_embryo/
npz = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz", allow_pickle=True)
X = npz["X"]  # Now works

Problem: Shape mismatch or “index out of range”

Likely cause: Different embryos have different N and T

Solution:

# Always check shape per embryo
for npz_path in paths:
    npz = np.load(npz_path)
    N, d, T = npz["X"].shape
    print(f"{npz_path.name}: N={N}, T={T}")

See Example 6


Problem: Memory error loading all 260 embryos

Likely cause: Not enough RAM for all data simultaneously

Solution:

# Load batch-wise
BATCH_SIZE = 10
for i in range(0, 260, BATCH_SIZE):
    batch = load_embryos(i, min(i+BATCH_SIZE, 260))
    # Process batch
    del batch  # Free memory

See DB_DOC Β§5.2


Contact & Support


Citation

If you use this preprocessed database, please cite:

@article{sulston1983lineage,
  title={The embryonic cell lineage of the nematode {C}aenorhabditis elegans},
  author={Sulston, JE and Schierenberg, E and White, JG and Thomson, JN},
  journal={Developmental Biology},
  volume={100},
  number={1},
  pages={64--119},
  year={1983}
}

Documentation Index Version: 1.0
Last Updated: April 2026
Total Docs: 4 files (~2000 lines + code examples)