{ "cells": [ { "cell_type": "markdown", "id": "18621e33", "metadata": {}, "source": [ "# ZEDprofiler Walkthrough\n", "\n", "ZEDprofiler is a CPU-first toolkit for extracting morphological features from 3D fluorescence microscopy images.\n", "It is designed for high-content and high-throughput workflows where images are volumetric z-stacks with multiple spectral channels and segmentation labels.\n", "\n", "## What this walkthrough covers\n", "\n", "By the end of this notebook you will have:\n", "\n", "1. **Generated synthetic 3D image data**: two channels (a compact nuclear stain and a diffuse marker with per-object colocalization) and a ground-truth label mask\n", "2. **Explored the data interactively**: browsed the z-stack volume with a slice-by-slice viewer\n", "3. **Loaded image sets and objects**: configured `ImageSetLoader`, `ObjectLoader`, and `TwoObjectLoader` to feed data into the pipeline\n", "4. **Extracted six feature classes**: colocalization, granularity, intensity, neighbors, texture, and volume/size/shape\n", "5. **Merged all features** into a single profile DataFrame with one row per object, following the ZEDprofiler naming convention\n", "6. **Normalized and feature selected** the profile using [`Pycytominer`](https://pycytominer.readthedocs.io): z-scoring features and removing low-variance and redundant columns\n", "7. **Visualized pairwise object similarity**: computed and plotted a Pearson correlation heatmap across all objects\n", "\n", "Each section includes a brief description of what is being computed and what the key parameters control." ] }, { "cell_type": "markdown", "id": "59b30dd9", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "22324587", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "annotation=NoneType required=False default_factory=list\n" ] } ], "source": "import warnings\nimport logging\nimport pathlib\n\nimport numpy as np\nimport pandas as pd\nimport tifffile\nfrom itables import show\n\nimport zedprofiler\nfrom zedprofiler.IO import loading_classes\n\nlogging.basicConfig(level=logging.WARNING)\nlogging.getLogger(\"matplotlib\").setLevel(logging.WARNING)\nwarnings.filterwarnings(\"ignore\", category=SyntaxWarning, module=\"mahotas\")" }, { "cell_type": "markdown", "id": "1aec073d", "metadata": {}, "source": [ "## Define Paths and Parameters\n", "\n", "This walkthrough uses synthetically generated 3D arrays so it runs anywhere without real microscopy data.\n", "Two random intensity channels and a label mask are generated from spherical objects placed at known positions.\n", "\n", "The `channel_mapping` dictionary maps logical names to substrings found in your image filenames: ZEDprofiler uses these keys to identify which file corresponds to which channel or segmentation label.\n", "\n", "*Expand the cells below to see how paths are configured and how the synthetic data is generated.*" ] }, { "cell_type": "code", "execution_count": 2, "id": "8a34d8cf", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "notebook_root = pathlib.Path.cwd().resolve()\n", "image_set_path = (notebook_root / \"test_data\" / \"dummy_image_set\").resolve()\n", "label_set_path = (notebook_root / \"test_data\" / \"dummy_label_set\").resolve()\n", "\n", "channel_mapping = {\n", " \"Channel1\": \"channel1\",\n", " \"Channel2\": \"channel2\",\n", " \"Nuclei\": \"nuclei_labels\",\n", "}\n", "\n", "CHANNEL1 = \"Channel1\"\n", "CHANNEL2 = \"Channel2\"\n", "COMPARTMENT = \"Nuclei\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "3aad616f", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generated volume: (80, 80, 30), objects: 6\n", "Channel 1: nuclear stain (compact)\n", "Channel 2: diffuse stain (per-object colocalization strength varies)\n" ] } ], "source": [ "# --- synthetic data parameters ---\n", "VOLUME_SHAPE = (80, 80, 30) # (x, y, z)\n", "N_OBJECTS = 6\n", "MIN_RADIUS, MAX_RADIUS = 6, 12\n", "rng = np.random.default_rng(seed=42)\n", "\n", "# --- build label mask and intensity channels ---\n", "# Channel 1: nuclear stain: compact, bright signal confined to each nucleus\n", "# Channel 2: diffuse stain: partially overlaps each nucleus to a varying degree,\n", "# simulating a cytoplasmic/organelle marker with per-object colocalization\n", "nuclei_labels = np.zeros(VOLUME_SHAPE, dtype=np.uint16)\n", "channel1_array = np.zeros(VOLUME_SHAPE, dtype=np.float32)\n", "channel2_array = np.zeros(VOLUME_SHAPE, dtype=np.float32)\n", "\n", "xs = np.arange(VOLUME_SHAPE[0])\n", "ys = np.arange(VOLUME_SHAPE[1])\n", "zs = np.arange(VOLUME_SHAPE[2])\n", "grid = np.stack(np.meshgrid(xs, ys, zs, indexing=\"ij\"), axis=-1)\n", "\n", "centers, radii = [], []\n", "for obj_id in range(1, N_OBJECTS + 1):\n", " for _ in range(50):\n", " r = rng.integers(MIN_RADIUS, MAX_RADIUS)\n", " cx = rng.integers(r + 2, VOLUME_SHAPE[0] - r - 2)\n", " cy = rng.integers(r + 2, VOLUME_SHAPE[1] - r - 2)\n", " cz = rng.integers(r + 2, VOLUME_SHAPE[2] - r - 2)\n", " center = np.array([cx, cy, cz])\n", " if all(\n", " np.linalg.norm(center - c) > r + rv + 4 for c, rv in zip(centers, radii)\n", " ):\n", " break\n", " centers.append(center)\n", " radii.append(r)\n", "\n", " dist = np.linalg.norm(grid - center, axis=-1)\n", " nucleus_mask = dist <= r\n", " nuclei_labels[nucleus_mask] = obj_id\n", "\n", " # Channel 1: compact nuclear signal with fine texture\n", " ch1_base = rng.uniform(0.7, 1.0)\n", " channel1_array[nucleus_mask] = ch1_base + rng.normal(\n", " 0, 0.06, size=nucleus_mask.sum()\n", " )\n", "\n", " # Channel 2: diffuse stain with per-object colocalization strength\n", " # coloc_strength controls how much signal sits inside vs outside the nucleus\n", " coloc_strength = rng.uniform(0.15, 0.95)\n", " diffuse_r = r * rng.uniform(1.3, 2.0)\n", " diffuse_mask = (dist <= diffuse_r) & ~nucleus_mask\n", "\n", " ch2_base = rng.uniform(0.4, 0.9)\n", " channel2_array[nucleus_mask] += coloc_strength * ch2_base + rng.normal(\n", " 0, 0.08, size=nucleus_mask.sum()\n", " )\n", " if diffuse_mask.any():\n", " channel2_array[diffuse_mask] += (\n", " 1 - coloc_strength\n", " ) * ch2_base * 0.6 + rng.normal(0, 0.06, size=diffuse_mask.sum())\n", "\n", "# clip negatives and scale to uint8\n", "channel1_array = np.clip(channel1_array / channel1_array.max() * 255, 0, 255).astype(\n", " np.uint8\n", ")\n", "channel2_array = np.clip(channel2_array / channel2_array.max() * 255, 0, 255).astype(\n", " np.uint8\n", ")\n", "\n", "# save to disk\n", "image_set_path.mkdir(parents=True, exist_ok=True)\n", "label_set_path.mkdir(parents=True, exist_ok=True)\n", "tifffile.imwrite(image_set_path / \"channel1.tif\", channel1_array)\n", "tifffile.imwrite(image_set_path / \"channel2.tif\", channel2_array)\n", "tifffile.imwrite(label_set_path / \"nuclei_labels.tif\", nuclei_labels)\n", "\n", "print(f\"Generated volume: {VOLUME_SHAPE}, objects: {N_OBJECTS}\")\n", "print(\"Channel 1: nuclear stain (compact)\")\n", "print(\"Channel 2: diffuse stain (per-object colocalization strength varies)\")" ] }, { "cell_type": "markdown", "id": "preview-md", "metadata": {}, "source": [ "## Preview Synthetic Data\n", "\n", "The interactive viewer below shows a z-slice browser for all three arrays. Use the slider to step through the volume and see how the objects and their labels change across z-planes.\n", "\n", "*Expand the cell below to see how the interactive viewer is built.*" ] }, { "cell_type": "code", "execution_count": 4, "id": "preview-code", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import plotly.graph_objects as go\n", "from IPython.display import HTML\n", "from plotly.subplots import make_subplots\n", "\n", "n_slices = channel1_array.shape[2]\n", "\n", "fig = make_subplots(\n", " rows=1,\n", " cols=3,\n", " subplot_titles=(\"Channel 1\", \"Channel 2\", \"Labels\"),\n", " horizontal_spacing=0.05,\n", ")\n", "\n", "frames = []\n", "for z in range(n_slices):\n", " frames.append(\n", " go.Frame(\n", " data=[\n", " go.Heatmap(\n", " z=channel1_array[:, :, z].T,\n", " colorscale=\"gray\",\n", " zmin=0,\n", " zmax=255,\n", " showscale=False,\n", " ),\n", " go.Heatmap(\n", " z=channel2_array[:, :, z].T,\n", " colorscale=\"gray\",\n", " zmin=0,\n", " zmax=255,\n", " showscale=False,\n", " ),\n", " go.Heatmap(\n", " z=nuclei_labels[:, :, z].T,\n", " colorscale=\"turbo\",\n", " zmin=0,\n", " zmax=N_OBJECTS,\n", " showscale=False,\n", " ),\n", " ],\n", " traces=[0, 1, 2],\n", " name=str(z),\n", " )\n", " )\n", "\n", "mid = n_slices // 2\n", "fig.add_trace(\n", " go.Heatmap(\n", " z=channel1_array[:, :, mid].T,\n", " colorscale=\"gray\",\n", " zmin=0,\n", " zmax=255,\n", " showscale=False,\n", " ),\n", " row=1,\n", " col=1,\n", ")\n", "fig.add_trace(\n", " go.Heatmap(\n", " z=channel2_array[:, :, mid].T,\n", " colorscale=\"gray\",\n", " zmin=0,\n", " zmax=255,\n", " showscale=False,\n", " ),\n", " row=1,\n", " col=2,\n", ")\n", "fig.add_trace(\n", " go.Heatmap(\n", " z=nuclei_labels[:, :, mid].T,\n", " colorscale=\"turbo\",\n", " zmin=0,\n", " zmax=N_OBJECTS,\n", " showscale=False,\n", " ),\n", " row=1,\n", " col=3,\n", ")\n", "\n", "fig.frames = frames\n", "\n", "fig.update_layout(\n", " title=f\"Synthetic 3D volume: z-slice browser ({n_slices} planes)\",\n", " height=380,\n", " sliders=[\n", " {\n", " \"active\": mid,\n", " \"steps\": [\n", " {\n", " \"args\": [[str(z)], {\"frame\": {\"duration\": 0}, \"mode\": \"immediate\"}],\n", " \"label\": str(z),\n", " \"method\": \"animate\",\n", " }\n", " for z in range(n_slices)\n", " ],\n", " \"x\": 0.1,\n", " \"len\": 0.8,\n", " \"y\": -0.05,\n", " \"currentvalue\": {\"prefix\": \"z = \", \"visible\": True, \"xanchor\": \"center\"},\n", " }\n", " ],\n", " updatemenus=[\n", " {\n", " \"type\": \"buttons\",\n", " \"showactive\": False,\n", " \"y\": -0.15,\n", " \"x\": 0.5,\n", " \"xanchor\": \"center\",\n", " \"buttons\": [\n", " {\n", " \"label\": \"▶ Play\",\n", " \"method\": \"animate\",\n", " \"args\": [None, {\"frame\": {\"duration\": 80}, \"fromcurrent\": True}],\n", " },\n", " {\n", " \"label\": \"⏸ Pause\",\n", " \"method\": \"animate\",\n", " \"args\": [[None], {\"frame\": {\"duration\": 0}, \"mode\": \"immediate\"}],\n", " },\n", " ],\n", " }\n", " ],\n", " margin={\"t\": 60, \"b\": 120},\n", ")\n", "fig.update_yaxes(scaleanchor=None)\n", "HTML(fig.to_html(include_plotlyjs=\"cdn\", full_html=False))" ] }, { "cell_type": "markdown", "id": "ebc65e08", "metadata": {}, "source": [ "## Loading Objects and Image Sets\n", "\n", "ZEDprofiler uses three loader classes to organize image data before featurization:\n", "\n", "- **`ImageSetConfig`**: declares which keys are raw intensity channels (`raw_image_key_name`) and which are segmentation labels (`label_key_name`). This config is passed to the loader so it knows how to partition the channel mapping.\n", "- **`ImageSetLoader`**: loads the full image set from disk (or from in-memory arrays) and stores each channel and label as a numpy array. The `anisotropy_spacing` parameter describes the physical voxel size in `[z, y, x]` order and is used by downstream features (e.g. neighbors, surface area) that require real-world distances.\n", "- **`ObjectLoader`**: a lightweight view over one channel + one compartment within a loaded image set. Most featurization functions accept an `ObjectLoader`.\n", "- **`TwoObjectLoader`**: a paired view over two channels within one compartment, required by colocalization which compares signal across two channels simultaneously." ] }, { "cell_type": "code", "execution_count": 5, "id": "1ac85a30", "metadata": {}, "outputs": [], "source": [ "isc = loading_classes.ImageSetConfig(\n", " image_set_name=\"test_image_set\",\n", " raw_image_key_name=[CHANNEL1, CHANNEL2],\n", " label_key_name=[COMPARTMENT],\n", ")\n", "\n", "image_set_loader = loading_classes.ImageSetLoader(\n", " anisotropy_spacing=[1, 0.1, 0.1],\n", " channel_mapping=channel_mapping,\n", " image_set_path=image_set_path,\n", " label_set_path=label_set_path,\n", " config=isc,\n", " # we can also directly pass arrays.\n", " # This is what ZEDprofiler expects when the images are multichannel.\n", " # Load the image arrays in and pass them in directly instead\n", " # of loading from path.\n", " image_set_array=None,\n", " label_set_array=None,\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "id": "2e6bc51d", "metadata": {}, "outputs": [], "source": [ "ol = loading_classes.ObjectLoader(\n", " image_set_loader=image_set_loader,\n", " channel_name=CHANNEL1,\n", " compartment_name=COMPARTMENT,\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "id": "7484fea2", "metadata": {}, "outputs": [], "source": [ "coloc_loader = loading_classes.TwoObjectLoader(\n", " image_set_loader=image_set_loader,\n", " compartment=COMPARTMENT,\n", " channel1=CHANNEL1,\n", " channel2=CHANNEL2,\n", ")" ] }, { "cell_type": "markdown", "id": "e426cff2", "metadata": {}, "source": [ "## Colocalization\n", "\n", "![Colocalization](../media/3D_Colocalization_Features.png)\n", "\n", "\n", "Colocalization measures the degree to which two fluorescent channels occupy the same spatial locations within a segmented compartment.\n", "This is useful for quantifying co-expression or co-localization of proteins in 3D cell images.\n", "\n", "ZEDprofiler computes several colocalization coefficients per object pair:\n", "\n", "- **Pearson correlation**: linear correlation of pixel intensities between the two channels\n", "- **Manders coefficients (M1, M2)**: fraction of channel 1 (or 2) signal that overlaps with channel 2 (or 1)\n", "- **Costes thresholded Manders**: Manders coefficients computed above a statistically determined threshold\n", "- **Overlap coefficient**: combined spatial and intensity overlap\n", "- **Rank-weighted colocalization (RWC)**: robust to intensity differences, weights overlap by rank\n", "\n", "Key parameters:\n", "- `thr`: minimum pixel intensity threshold below which pixels are excluded\n", "- `fast_costes`: `\"Accurate\"` uses the full bisection algorithm for the Costes threshold; `\"Fast\"` uses a linear approximation" ] }, { "cell_type": "code", "execution_count": 8, "id": "ac5b5732", "metadata": {}, "outputs": [], "source": [ "colocalization_features = zedprofiler.colocalization.compute_colocalization(\n", " two_object_loader=coloc_loader,\n", " thr=15,\n", " fast_costes=\"Accurate\",\n", " channel1=CHANNEL1,\n", " channel2=CHANNEL2,\n", ")" ] }, { "cell_type": "markdown", "id": "5e956a61", "metadata": {}, "source": [ "## Granularity\n", "\n", "![Granularity](../media/3D_Granularity_Features.png)\n", "\n", "\n", "Granularity characterizes the texture of an object at multiple spatial scales by repeatedly applying morphological erosions and measuring how much signal is removed at each scale.\n", "The result is a spectrum of values: one per scale: that captures whether an object's intensity is concentrated in fine, coarse, or mixed-scale structures.\n", "This is analogous to CellProfiler's granularity module and is particularly useful for distinguishing organelle morphology.\n", "\n", "Key parameters:\n", "- `granular_spectrum_length`: number of erosion scales to compute (default 16); higher values capture coarser structure\n", "- `radius`: radius of the structuring element used for background subtraction before the spectrum is computed\n", "- `subsample_size`: fraction of the image used during spectrum computation (speeds up processing on large volumes)\n", "- `image_sample_size`: fraction used for background estimation\n", "- `mask_threshold`: threshold applied after interpolation to define the object mask" ] }, { "cell_type": "code", "execution_count": 9, "id": "d1df478b", "metadata": {}, "outputs": [], "source": [ "granularity_features = zedprofiler.granularity.compute_granularity(\n", " object_loader=ol,\n", " radius=10, # radius of the structuring element for background\n", " # removal (CellProfiler default)\n", " granular_spectrum_length=16, # range of the granular spectrum\n", " subsample_size=0.25, # subsample the image for faster processing\n", " image_sample_size=0.25, # further subsample for background removal\n", " mask_threshold=0.9, # threshold for determining mask after interpolation\n", " verbose=False, # print out intermediate steps and values for debugging\n", ")" ] }, { "cell_type": "markdown", "id": "58b5ebe9", "metadata": {}, "source": [ "## Intensity\n", "\n", "![Intensity](../media/3D_Intensity_Features.png)\n", "\n", "\n", "Intensity features summarize the distribution of pixel intensities within each segmented object.\n", "They provide a compact description of how bright an object is, how variable its signal is, and how that signal is spatially distributed relative to the object's edges.\n", "\n", "Features computed per object include:\n", "- **Mean, median, standard deviation, max, min** of voxel intensities\n", "- **Integrated intensity**: sum of all voxel intensities (correlated with object volume)\n", "- **Mass displacement**: distance between the object's geometric centroid and its intensity-weighted centroid, indicating asymmetric signal distribution\n", "- **Lower/upper quartile intensities**: robust measures of the intensity distribution" ] }, { "cell_type": "code", "execution_count": 10, "id": "6159fc35", "metadata": {}, "outputs": [], "source": [ "intensity_features = zedprofiler.intensity.compute_intensity(\n", " object_loader=ol,\n", ")" ] }, { "cell_type": "markdown", "id": "45b0dbbd", "metadata": {}, "source": [ "## Neighbors\n", "\n", "![Neighbors](../media/3D_Neighbors_Features.png)\n", "\n", "\n", "Neighbor features describe the spatial relationships between objects within a compartment.\n", "For each object, ZEDprofiler identifies all other objects whose centroids fall within a given distance and computes summary statistics over that neighborhood.\n", "\n", "Features include counts of neighbors, mean and standard deviation of distances to neighbors, and the angle between each object and its closest neighbor.\n", "\n", "Key parameters:\n", "- `distance_threshold`: maximum centroid-to-centroid distance (in voxels) for two objects to be considered neighbors\n", "- `anisotropy_factor`: the ratio of z-spacing to xy-spacing (`anisotropy_spacing[0] / anisotropy_spacing[1]`); used to rescale z-distances to physical units before computing Euclidean distances, ensuring accurate neighbor relationships in anisotropic volumes" ] }, { "cell_type": "code", "execution_count": 11, "id": "b9096f92", "metadata": {}, "outputs": [], "source": [ "neighbors_features = zedprofiler.neighbors.compute_neighbors(\n", " object_loader=ol,\n", " distance_threshold=10,\n", " anisotropy_factor=image_set_loader.anisotropy_spacing[0]\n", " / image_set_loader.anisotropy_spacing[\n", " 1\n", " ], # z to xy spacing ratio for distance calculation\n", ")" ] }, { "cell_type": "markdown", "id": "fd7bde2a", "metadata": {}, "source": [ "## Texture\n", "\n", "![Texture](../media/3D_Texture_Features.png)\n", "\n", "\n", "Texture features capture the spatial arrangement of intensities within an object using the gray-level co-occurrence matrix (GLCM).\n", "The GLCM records how often pairs of pixel intensities occur at a given spatial offset, and summary statistics computed from it describe properties like smoothness, contrast, and periodicity.\n", "\n", "ZEDprofiler computes GLCM-based features (e.g. angular second moment, contrast, correlation, entropy, homogeneity) at multiple offsets and directions in 3D, then aggregates across directions.\n", "\n", "Key parameters:\n", "- `distance`: the pixel offset at which co-occurrence is measured; larger values capture coarser texture patterns\n", "- `grayscale`: the number of intensity bins used to quantize the image before computing the GLCM; fewer bins are faster but lose intensity resolution" ] }, { "cell_type": "code", "execution_count": 12, "id": "8ba3d9e5", "metadata": {}, "outputs": [], "source": [ "texture_features = zedprofiler.texture.compute_texture(\n", " object_loader=ol,\n", " distance=3,\n", " grayscale=256,\n", ")" ] }, { "cell_type": "markdown", "id": "9d26cb54", "metadata": {}, "source": [ "## Volume, Size, and Shape\n", "\n", "![Volume, Size, and Shape](../media/3D_Volume_Shape_Features.png)\n", "\n", "\n", "Volume, size, and shape features characterize the 3D geometry of each segmented object.\n", "They are computed using `skimage.measure.regionprops` on the label image, with surface area calculated via marching cubes.\n", "\n", "Features per object include:\n", "- **Volume**: total number of voxels in the object\n", "- **Bounding box**: min/max extents in x, y, z\n", "- **BboxVolume**: volume of the axis-aligned bounding box\n", "- **Centroid**: geometric center in x, y, z\n", "- **Extent**: ratio of object volume to bounding box volume (compactness)\n", "- **Euler number**: topological measure of holes/tunnels in the object\n", "- **Equivalent diameter**: diameter of a sphere with the same volume\n", "- **Surface area**: estimated via marching cubes on the object boundary" ] }, { "cell_type": "code", "execution_count": 13, "id": "742427df", "metadata": {}, "outputs": [], "source": [ "volume_size_shape_features = zedprofiler.volumesizeshape.compute_volume_size_shape(\n", " image_set_loader=image_set_loader,\n", " object_loader=ol,\n", ")" ] }, { "cell_type": "markdown", "id": "d266218a", "metadata": {}, "source": [ "## Merging All Features into a Single Profile\n", "\n", "Each featurization function returns a dictionary that can be converted to a DataFrame.\n", "All DataFrames share two index columns: `Metadata_Experiment_ImageSet` and `Metadata_Object_ObjectID`: which are used as merge keys.\n", "\n", "The final merged DataFrame contains one row per object and one column per feature, following the ZEDprofiler naming convention:\n", "`{Compartment}_{Channel}_{FeatureType}_{Measurement}`\n", "\n", "See the [Feature Schema](../features/Feature_Naming_Convention.md) for the full specification of how columns are named and structured.\n", "\n", "The cell below also estimates the total expected feature count when scaling to a full experiment with multiple compartments and channels, giving a sense of the profile dimensionality at scale." ] }, { "cell_type": "code", "execution_count": 14, "id": "9c482ad6", "metadata": {}, "outputs": [], "source": [ "coloc_df = pd.DataFrame(colocalization_features)\n", "granularity_df = pd.DataFrame(granularity_features)\n", "intensity_df = pd.DataFrame(intensity_features)\n", "neighbors_df = pd.DataFrame(neighbors_features)\n", "texture_df = pd.DataFrame(texture_features)\n", "volume_size_shape_df = pd.DataFrame(volume_size_shape_features)\n", "# subtract 2 from each df shape to account for image_set and object_id\n", "# columns that will be used for merging\n", "size_of_all_dfs = [\n", " df.shape[1] - 2\n", " for df in [\n", " coloc_df,\n", " granularity_df,\n", " intensity_df,\n", " neighbors_df,\n", " texture_df,\n", " volume_size_shape_df,\n", " ]\n", "]\n", "# add the 2 back to account for the image_set and object_id columns\n", "# that will be retained in the final merged DataFrame\n", "expected_num_columns = sum(size_of_all_dfs) + 2" ] }, { "cell_type": "code", "execution_count": 15, "id": "17246a24", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape: 6 objects x 233 features\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " Loading ITables v2.8.0 from the internet...\n", " (need help?)\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "columns_to_merge = [\"Metadata_Experiment_ImageSet\", \"Metadata_Object_ObjectID\"]\n", "\n", "for df in [\n", " coloc_df,\n", " granularity_df,\n", " intensity_df,\n", " neighbors_df,\n", " texture_df,\n", " volume_size_shape_df,\n", "]:\n", " if not all(col in df.columns for col in columns_to_merge):\n", " raise ValueError(\n", " f\"DataFrame is missing required columns for merging: {columns_to_merge}\"\n", " )\n", "# merge all features into a single DataFrame\n", "all_features_df = pd.merge(\n", " coloc_df,\n", " pd.merge(\n", " granularity_df,\n", " pd.merge(\n", " intensity_df,\n", " pd.merge(\n", " neighbors_df,\n", " pd.merge(\n", " texture_df, volume_size_shape_df, on=columns_to_merge, how=\"inner\"\n", " ),\n", " on=columns_to_merge,\n", " how=\"inner\",\n", " ),\n", " on=columns_to_merge,\n", " how=\"inner\",\n", " ),\n", " on=columns_to_merge,\n", " how=\"inner\",\n", " ),\n", " on=columns_to_merge,\n", " how=\"inner\",\n", ")\n", "if all_features_df.shape[1] != expected_num_columns:\n", " raise ValueError(\n", " f\"\"\"\n", " Merged DataFrame has {all_features_df.shape[1]} columns\n", " but expected {expected_num_columns} based on individual DataFrames\"\"\"\n", " )\n", "print(\n", " f\"Shape: {all_features_df.shape[0]} objects x {all_features_df.shape[1]} features\"\n", ")\n", "show(all_features_df)" ] }, { "cell_type": "code", "execution_count": 16, "id": "e1d50487", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3698" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expected_num_compartments = 4\n", "expected_num_of_channels = 4\n", "expected_num_of_total_features = (\n", " expected_num_columns - 2\n", ") # subtract 2 for image_set and object_id columns\n", "expected_num_of_total_features = (\n", " expected_num_of_total_features\n", " * expected_num_compartments\n", " * expected_num_of_channels\n", " + 2\n", ")\n", "expected_num_of_total_features" ] }, { "cell_type": "code", "execution_count": 17, "id": "5c3e19fb", "metadata": {}, "outputs": [], "source": [ "if all_features_df.shape[1] != expected_num_columns:\n", " raise ValueError(\n", " f\"Expected {expected_num_columns} columns but got {all_features_df.shape[1]}\"\n", " )" ] }, { "cell_type": "markdown", "id": "pycytominer-md", "metadata": {}, "source": [ "## Normalize and Feature Select\n", "\n", "With all features merged into a single profile DataFrame, we use [`Pycytominer`](https://pycytominer.readthedocs.io) to:\n", "\n", "1. **Normalize**: z-score each feature across objects so all features are on the same scale\n", "2. **Feature select**: remove low-variance and highly correlated features that carry redundant information\n", "\n", "[`Pycytominer`](https://pycytominer.readthedocs.io) infers which columns are features (prefixed `Nuclei_`) and which are metadata (prefixed `Metadata_`) automatically." ] }, { "cell_type": "code", "execution_count": 18, "id": "pycytominer-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features after selection: 19 (from 231 original)\n" ] } ], "source": [ "from pycytominer import feature_select, normalize\n", "\n", "# z-score normalize across all objects\n", "normalized_df = normalize(\n", " profiles=all_features_df,\n", " method=\"standardize\",\n", " output_file=None,\n", ")\n", "\n", "# remove low-variance and highly correlated features\n", "selected_df = feature_select(\n", " profiles=normalized_df,\n", " operation=[\"variance_threshold\", \"correlation_threshold\"],\n", " output_file=None,\n", ")\n", "\n", "meta_cols = [c for c in selected_df.columns if c.startswith(\"Metadata_\")]\n", "feature_cols = [c for c in selected_df.columns if c not in meta_cols]\n", "print(\n", " f\"Features after selection: {len(feature_cols)} (from {all_features_df.shape[1] - len(meta_cols)} original)\"\n", ")" ] }, { "cell_type": "markdown", "id": "corr-md", "metadata": {}, "source": [ "## Pairwise Object Correlation\n", "\n", "The heatmap below shows the Pearson correlation between every pair of objects across all selected features. Objects with similar morphological profiles appear as warm (high correlation) clusters; dissimilar objects appear cool.\n", "\n", "Each row and column is labelled by object ID." ] }, { "cell_type": "code", "execution_count": null, "id": "corr-code", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax. Maybe you meant '==' or ':=' instead of '='? (142425945.py, line 20)", "output_type": "error", "traceback": [ " \u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[19]\u001b[39m\u001b[32m, line 20\u001b[39m\n\u001b[31m \u001b[39m\u001b[31mcolorbar={title=\"Pearson r\"},\u001b[39m\n ^\n\u001b[31mSyntaxError\u001b[39m\u001b[31m:\u001b[39m invalid syntax. Maybe you meant '==' or ':=' instead of '='?\n" ] } ], "source": [ "import plotly.graph_objects as go\n", "from IPython.display import HTML\n", "\n", "# compute pairwise correlation between objects (rows = objects)\n", "feature_matrix = selected_df[feature_cols].astype(float)\n", "corr_matrix = feature_matrix.T.corr()\n", "\n", "object_ids = selected_df[\"Metadata_Object_ObjectID\"].astype(str).tolist()\n", "labels = [f\"Object {oid}\" for oid in object_ids]\n", "\n", "fig = go.Figure(\n", " go.Heatmap(\n", " z=corr_matrix.values,\n", " x=labels,\n", " y=labels,\n", " colorscale=\"RdBu\",\n", " zmid=0,\n", " zmin=-1,\n", " zmax=1,\n", " colorbar={\"title\": \"Pearson r\"},\n", " text=corr_matrix.round(2).values,\n", " texttemplate=\"%{text}\",\n", " )\n", ")\n", "\n", "fig.update_layout(\n", " title=\"Pairwise object correlation (normalized, selected features)\",\n", " height=500,\n", " xaxis={\"tickangle\": 45},\n", " margin={\"t\": 60, \"b\": 120, \"l\": 120},\n", ")\n", "\n", "HTML(fig.to_html(include_plotlyjs=\"cdn\", full_html=False))" ] } ], "metadata": { "kernelspec": { "display_name": "zedprofiler (3.13.1)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }