EPIC Database Documentation Index
Quick Navigation
π For First-Time Users
- Start here: QUICK_REFERENCE.md β 5-minute overview
- Then read: DATABASE_DOCUMENTATION.md β Sections 1β3 (input, processing, output)
- Try this: Run
python scripts/usage_examples.pyto see working code
ποΈ For System Design
- ARCHITECTURE.md β Data flow diagrams, tensor construction, edge building logic
π» For Implementation
- Database_DOCUMENTATION.md β Code recipes and patterns
- scripts/usage_examples.py β 7 runnable examples
- src/epic_preprocess.py β Core preprocessing functions
π For Debugging & Validation
- DATABASE_DOCUMENTATION.md β QC checks and error handling
- QUICK_REFERENCE.md β Common errors & fixes
π For Data Analysis
- DATABASE_DOCUMENTATION.md β Manifest structure, batch loading
- scripts/usage_examples.py β Example 6 (batch statistics)
File Manifest
docs/
βββ README.md (this file)
βββ DATABASE_DOCUMENTATION.md (Comprehensive: 11 sections, ~1000 lines)
βββ QUICK_REFERENCE.md (Quick-lookup: 10 sections, ~300 lines)
βββ ARCHITECTURE.md (Visual/technical: data flow, queries)
β
scripts/
βββ preprocess_dataset.py (Main preprocessing script)
βββ usage_examples.py (7 runnable examples)
βββ make_figures.py (Visualization utils)
β
src/
βββ epic_preprocess.py (Core preprocessing functions)
β βββ build_local_index()
β βββ preprocess_epic_file_sparse()
β βββ helper functions
β
dataset/
βββ raw/ (260 *.csv files β raw EPIC input)
β βββ CD011505_end1red_bright.csv
β βββ CD011605_5a_bright.csv
β βββ ... (258 more)
β
βββ processed/by_embryo/ (260 *.npz files β processed output)
βββ CD011505_end1red_bright.npz
βββ CD011605_5a_bright.npz
βββ ... (258 more)
βββ manifest.txt
Document Overview
π DATABASE_DOCUMENTATION.md (Comprehensive Reference)
Length: ~1200 lines | Audience: Everyone (but especially developers)
| Section | Content | Use Case |
|---|---|---|
| 1 | Raw EPIC CSV format & columns | Understanding input data |
| 2 | Preprocessing pipeline details | How data is transformed |
| 3 | Output NPZ tensors specifications | Data format reference |
| 4 | Biological interpretation | Understanding C. elegans |
| 5 | Memory & performance | System requirements |
| 6 | Usage patterns | How to load/query data |
| 7 | File manifest & provenance | Tracking & validation |
| 8 | QC & validation checks | Data quality assurance |
| 9 | Common operations & recipes | Code examples |
| 10 | Summary pipeline diagram | High-level overview |
| 11 | Troubleshooting | Error handling |
π QUICK_REFERENCE.md (Quick Lookup)
Length: ~300 lines | Audience: Experienced users needing fast reference
| Section | Content |
|---|---|
| Structure | 5-second file layout |
| Dimensions | Tensor shapes at a glance |
| MWE | Minimal working example |
| Features | Column definitions table |
| Masking | Handling unborn cells |
| Mapping | Cell name β index conversions |
| Edge Lists | Sparse β dense example |
| Contact vs. Lineage | Separating edge types |
| Time Windows | Temporal slicing |
| Batch Loading | Multi-embryo patterns |
| QC Checklist | Validation tests |
| Performance | Tips & tricks |
| Metadata | Tracking info |
ποΈ ARCHITECTURE.md (System Design & Data Flow)
Length: ~400 lines |Audience: System designers, advanced users
| Section | Content |
|---|---|
| System Diagram | Full pipeline ASCII art |
| Tensor Relationships | Dimension mapping & constraints |
| CSV β Tensor | Example cell trajectory |
| Spatial Edge Logic | Proximity algorithm |
| Lineage Edge Logic | Ancestry naming reconstruction |
| Alive Masking | Birth tracking data flow |
| NPZ Structure | File contents breakdown |
| Sparse Compression | Memory savings analysis |
| Query Patterns | Example SQL-like queries |
| Reproducibility | Determinism guarantees |
| Performance | Benchmarks & scalability |
Usage Patterns by Role
π§ͺ Data Scientist / ML Researcher
Goal: Train models on cell dynamics
Read:
- QUICK_REFERENCE.md (Section “Minimal Working Example”)
- DATABASE_DOCUMENTATION.md (Sections 3, 6, 9)
- scripts/usage_examples.py (Examples 1, 2, 6)
Code pattern:
# Load batch of embryos
from pathlib import Path
import numpy as np
for npz_path in sorted(Path("dataset/processed/by_embryo").glob("*.npz"))[:10]:
X = np.load(npz_path)["X"] # (N, 5, T)
# Train your model...
π¬ Biologist / Domain Expert
Goal: Understand cell behavior, verify data quality
Read:
- DATABASE_DOCUMENTATION.md (Sections 1, 4, 8)
- QUICK_REFERENCE.md (Sections “Feature Definitions”, “Cell Name β Index”)
- scripts/usage_examples.py (Examples 3, 5)
Code pattern:
# Track a cell's developmental history
cell_name = "ABal"
trajectory = [X[cell_idx, :, t] for t in range(T) if alive_mask[cell_idx, t]]
# Analyze migration, division timing, etc.
π οΈ Software Engineer / DevOps
Goal: Process data, manage pipelines, ensure data quality
Read:
- DATABASE_DOCUMENTATION.md (Sections 2, 5, 7, 8, 11)
- ARCHITECTURE.md (Sections “Data Pipeline”, “Validation”)
- scripts/preprocess_dataset.py (runnable script)
Code pattern:
# Run full preprocessing pipeline
python scripts/preprocess_dataset.py \
--raw_dir dataset/raw \
--out dataset/processed/by_embryo \
--distance_threshold 20
# Verify output
python scripts/usage_examples.py # Run all QC checks
π Student / New Contributor
Goal: Understand the system from scratch
Path:
- Week 1: QUICK_REFERENCE.md (all sections)
- Week 1: Run
python scripts/usage_examples.py - Week 2: Read DATABASE_DOCUMENTATION.md (sections 1β6)
- Week 2: Study ARCHITECTURE.md diagrams
- Week 3: Dive into src/epic_preprocess.py code
FAQ: Where Do I Find…?
| Question | Answer |
|---|---|
| “What are the feature columns?” | DB_DOC Β§1.2, QUICK_REF |
| “How do I load an embryo?” | QUICK_REF, Example 1 |
| “What’s the shape of X?” | DB_DOC Β§3.3, QUICK_REF Β§ Dimensions |
| “Why are some cells all zeros?” | DB_DOC Β§3.3, QUICK_REF Β§Masking |
| “How are edges encoded?” | DB_DOC Β§3.3, ARCH |
| “What do lineage edges mean?” | DB_DOC Β§4.2, ARCH Β§ Lineage |
| “How do I find a cell’s daughters?” | QUICK_REF, Example 3 |
| “How much memory do I need?” | DB_DOC Β§5.2, ARCH Β§ Scalability |
| “How do I run preprocessing?” | DB_DOC Β§2.3, README.md |
| “How do I validate my data?” | DB_DOC Β§8, Example 7 |
Key Concepts Glossary
Core Terms
| Term | Definition | See Also |
|---|---|---|
| EPIC | eMbryo Project Imaging Consortium β fluorescence microscopy dataset of C. elegans development | DB_DOC Β§1.1 |
| Embryo | One complete developmental recording; stored as one NPZ file | All docs |
| Cell | Individual nucleus tracked through development; identified by C. elegans nomenclature (e.g., “ABal”) | DB_DOC Β§4.1 |
| Timepoint | One frame of video; multiple rows per timepoint (one per cell) | DB_DOC Β§1.2 |
| Node | Synonym for cell when represented as graph node; indexed 0..N-1 | All docs |
| Feature | Measured property of a cell: x, y, z, size, blot (d=5 dimensions) | QUICK_REF Β§ Features |
| Edge | Connection between two cells; either spatial (proximity) or lineage (ancestry) | ARCH Β§ Edge Construction |
| Spatial edge | Undirected edge connecting physically close cells (< 20 ΞΌm) | DB_DOC Β§2.2 |
| Lineage edge | Directed edge from parent to daughter cell (inferred from naming) | DB_DOC Β§2.2 |
| Tensor | Multi-dimensional NumPy array; e.g., X[N, d, T] | DB_DOC Β§3 |
| Sparse graph | Edge list representation: (edge_src, edge_dst, edge_t) vs. dense adjacency A[N,N,T] | DB_DOC Β§5.1 |
| alive_mask | Boolean tensor [N, T] indicating when each cell is born and observed | DB_DOC Β§3.3 |
| Masked cell | Cell not yet born; has zero features and alive_mask[cell,t]=False | QUICK_REF |
Biological Terms
| Term | Definition | See Also |
|---|---|---|
| C. elegans | Caenorhabditis elegans β nematode worm; standard model organism | DB_DOC Β§4 |
| Embryo | Developmental stage from zygote (~1 cell) to ~700 cells | DB_DOC Β§4.3 |
| Cell division | Binary fission: one mother β two daughters; tracked via lineage tree | DB_DOC Β§4.1 |
| Lineage | Ancestry tree; graph of cell divisions from fertilized egg to final cells | ARCH Β§ Lineage Edges |
| Cell naming | Standard nomenclature encoding lineage: ABβABaβABal (last character = division history) | DB_DOC Β§4.1 |
| Fluorescence | “blot” feature; intensity of fluorescent marker for cell tracking | DB_DOC Β§1.2 |
| Migration | Cell movement tracked via (x, y, z) coordinates | Example 5 |
Troubleshooting Guide
Problem: “FileNotFoundError: No such file or directory”
Likely cause: You haven’t run preprocessing yet
Solution:
cd d:\Github\Spatio-Temporal-Evolution
python scripts/preprocess_dataset.py --raw_dir dataset/raw --out dataset/processed/by_embryo
See DB_DOC Β§2.3
Problem: “KeyError: ‘X’”
Likely cause: Wrong file format; trying to load non-NPZ file
Solution:
import numpy as np
# Correct: load from processed/by_embryo/
npz = np.load("dataset/processed/by_embryo/CD011605_5a_bright.npz", allow_pickle=True)
X = npz["X"] # Now works
Problem: Shape mismatch or “index out of range”
Likely cause: Different embryos have different N and T
Solution:
# Always check shape per embryo
for npz_path in paths:
npz = np.load(npz_path)
N, d, T = npz["X"].shape
print(f"{npz_path.name}: N={N}, T={T}")
See Example 6
Problem: Memory error loading all 260 embryos
Likely cause: Not enough RAM for all data simultaneously
Solution:
# Load batch-wise
BATCH_SIZE = 10
for i in range(0, 260, BATCH_SIZE):
batch = load_embryos(i, min(i+BATCH_SIZE, 260))
# Process batch
del batch # Free memory
See DB_DOC Β§5.2
Contact & Support
- Code issues: Check DB_DOC Β§11
- Data quality: See DB_DOC Β§8 for validation checklist
- Biological questions: Refer to DB_DOC Β§4
Citation
If you use this preprocessed database, please cite:
@article{sulston1983lineage,
title={The embryonic cell lineage of the nematode {C}aenorhabditis elegans},
author={Sulston, JE and Schierenberg, E and White, JG and Thomson, JN},
journal={Developmental Biology},
volume={100},
number={1},
pages={64--119},
year={1983}
}
Documentation Index Version: 1.0
Last Updated: April 2026
Total Docs: 4 files (~2000 lines + code examples)