Logo

PBMCpedia

PBMCpedia Tutorial

Overview

PBMCpedia is a comprehensive, large-scale single-cell reference atlas of human peripheral blood mononuclear cells (PBMCs), created to facilitate deep exploration of immune cell diversity across age, sex, and disease states. This resource integrates 4.3M high-quality PBMC profiles from 25 publicly available single-cell RNA-sequencing (scRNA-seq) studies, harmonized and annotated using standardized and reproducible pipelines.

🎯 Main Functions

PBMCpedia provides several key analytical tools:

  • πŸ“Š Dataset Overview - Explore sample metadata and study information
  • 🧬 DEGs & Pathways - Identify differentially expressed genes and enriched pathways
  • πŸ”¬ Cell Type Overview - Visualize cell type proportions and gene expression patterns
  • 🎯 Marker Comparison - Compare dataset-derived markers with reference databases
  • πŸ” Gene Exploration - Explore gene expression across cell types and conditions
  • 🧬 TCR/BCR Analysis - Analyze T-cell and B-cell receptor repertoires
  • πŸ”¬ Surface Proteins - Explore protein expression patterns across diseases
πŸ“‹ Data Processing Pipeline

We collected raw gene expression data and metadata from 25 studies available in the Sequence Read Archive (SRA), which were mapped against the human reference genome using CellRanger. Each sample was quality-controlled to remove poor-quality cells and low-complexity samples using CellBender. Scrublet was applied for doublet detection.

All data were processed through a uniform scRNA-seq pipeline using Scanpy and AnnData objects, consisting of:

  1. Filtering cells by mitochondrial content and gene count
  2. Normalization and log-transformation
  3. Highly variable gene selection
  4. Scaling and PCA
  5. Batch correction using Harmony
  6. Clustering and UMAP for visualization
  7. Cell type annotation using both manual expert curation and automated classification with CellTypist (AIFI_L1 to L3)
πŸ“Έ Graphical Abstract
PBMCpedia Graphical Abstract
🏠 Homepage Example Cases

The homepage provides quick access to common analysis scenarios through example links:

Homepage Example Cases
Getting Started

Start by exploring the Dataset Overview to understand the available data, then use the example links on the homepage to quickly access specific analyses. Each section includes detailed instructions and interactive visualizations.

Dataset Overview & Metadata

The Dataset Overview page provides a comprehensive summary of all samples and studies included in PBMCpedia. This is your starting point for understanding the available data and planning your analyses.

πŸ“Š Disease Distribution Bar Chart

The first plot displays all diseases found in the metadata on the x-axis, with bars representing the number of associated samples per disease. It's an interactive bar chart: when you hover over a bar, a tooltip appears showing detailed summary information for the corresponding study, including:

  • BioProject accession - Links to NCBI BioProject database
  • Disease - The specific condition or healthy control
  • Total sample count - Number of samples in the study
  • Sex breakdown - Distribution of male/female/unknown donors
  • Details - Short description of the study
  • DOI - Link to the original publication
Additional Metadata Visualizations
πŸ“‹ Sample Metadata Table

Below the bar chart, you'll find a comprehensive metadata table with the following columns:

  • Project - Project identifiers (e.g., P10, P13)
  • BioSample - SAMN accession numbers (unique biosample identifiers)
  • BioProject - PRJNA accessions (NCBI BioProject study IDs)
  • Age - Age display information
  • Sex - Male, Female, or Unknown
  • Disease - Donor condition (e.g., COVID-19, Healthy, Alzheimer's disease)
πŸ” Interactive Features
  • Search - Filter by typing keywords (e.g., COVID, female)
  • Sort - Click column headers to sort by any field
  • Pagination - Navigate through large datasets
  • Export - Download filtered results as CSV
Metadata Table with Search and Sorting
Pro Tip

Use the metadata table to identify specific samples or studies for your analysis. The search function is particularly useful for finding samples from specific diseases or demographic groups.

DEGs & Pathways Analysis

The DEGs and Pathways page allows you to identify differentially expressed genes and enriched biological pathways across different subsets of the PBMC dataset. This is one of the most powerful analytical tools in PBMCpedia for understanding disease mechanisms and immune responses.

πŸŽ›οΈ Filter Selection

To begin your analysis, define your subset of interest by selecting the following filters:

  • Cell Type & Resolution: Choose between broad (major cell types) or fine (detailed subtypes) resolution
  • Disease: Focus on specific conditions (e.g., COVID-19, Alzheimer's disease) or compare healthy vs diseased states
  • Sex: Filter by male, female, or all sex donors
  • Age Group: Select age brackets - Young (0–24 yrs), Adult (25–64 yrs), Elderly (>65 yrs), or All
  • Study: Focus on specific studies or analyze across all studies
DEG and Pathways Filter Selection Interface

After setting your filters, click Run Analysis to compute DEGs and associated pathways. The analysis typically takes a few seconds to complete.

🧬 Differentially Expressed Genes (DEGs)

The analysis results begin with an interactive volcano plot that visualizes the significance and magnitude of gene expression changes:

  • X-axis: Log2 fold change (magnitude of expression difference)
  • Y-axis: -log10(p-value) (statistical significance)
  • Color coding: Red points indicate significantly upregulated genes, blue points show downregulated genes
  • Interactive features: Hover over points to see gene names and statistics

You can manually highlight up to 10 specific genes by entering their names as a comma-separated list in the "Highlight Genes" field.

Interactive Volcano Plot with Gene Selection
πŸ“Š DEG Results Table

Below the volcano plot, a detailed DEG table provides comprehensive information for each differentially expressed gene:

  • Gene - Gene symbol and name
  • log2FC - Log2 fold change (positive = upregulated, negative = downregulated)
  • p-value - Statistical significance (raw p-value)
  • Cell Type - The cell type where the gene is differentially expressed
  • Age - Age group analyzed
  • Sex - Sex group analyzed
  • Disease - Disease condition compared
  • Resolution - Cell type annotation resolution used
  • Study - Study identifier
DEG Results Table
πŸ“Š Pathway Enrichment Analysis

Following the DEGs, a bar plot displays the top enriched pathways, ranked by statistical significance.

A detailed pathway table is also provided, including:

  • Pathway - Pathway name
  • Description - Detailed pathway description
  • Database - Source database (e.g., BP)
  • p-value - Statistical significance
  • Score - Enrichment score (calculated using GSEA)
  • Cell Type - Cell type analyzed
  • Age - Age group analyzed
  • Sex - Sex group analyzed
  • Disease - Disease condition compared
  • Resolution - Cell type annotation resolution used
  • Study - Study identifier
Pathway Enrichment Results

Cell Type Overview

The Cell Type Overview page provides an interactive UMAP visualization of all cell types in the dataset, allowing you to explore cell type distributions, proportions, and gene expression patterns.

πŸ—ΊοΈ Interactive UMAP Visualization

The main feature is an interactive UMAP plot showing all cell types:

  • Resolution Selection: Toggle between broad (major cell types) and fine (detailed subtypes) resolution
  • Interactive Dots: Each dot represents a cell type, sized by the number of cells
  • Click to Select: Click on any dot to view detailed information about that cell type
  • Cell Type Information: Selected cell types show cell count and proportion of the dataset
Interactive UMAP Cell Type Visualization
πŸ“Š Cell Type Details Panel

When you select a cell type, the right panel displays:

  • Cell Count: Number of cells of this type in the dataset
  • Proportion: Percentage of total cells
  • Marker Genes Link: Direct access to marker gene analysis for this cell type
  • Dropdown Selection: Alternative way to select cell types
Cell Type Details Panel
🧬 Gene Expression Dot Plot

The page also includes a gene expression dot plot feature:

  • Gene Input: Enter up to 10 genes (comma-separated)
  • Resolution Selection: Choose broad or fine resolution for analysis
  • Interactive Plot: Shows gene expression across cell types
  • Error Handling: Provides suggestions for gene names not found
Gene Expression Dot Plot
πŸ’‘ Use Cases
  • Cell Type Exploration: Discover the diversity of immune cell types in the dataset
  • Proportion Analysis: Understand the relative abundance of different cell types
  • Gene Expression Validation: Verify expected gene expression patterns in specific cell types
  • Marker Gene Discovery: Use the dot plot to identify cell type-specific markers
Pro Tip

Start with the broad resolution to get an overview of major cell types, then switch to fine resolution to explore detailed subtypes. Use the example links on the homepage to quickly access specific cell types like Memory CD8 T cells.

Marker Comparison

The Cell Type Markers section allows you to explore both known and dataset-derived marker genes for specific immune cell types.

πŸŽ›οΈ Selection Options

To begin, choose the following from the sidebar:

  • Cell Type Resolution: Broad or Fine
  • Cell Type: Select the specific population of interest
  • Genes: (Optional) Enter a comma-separated list of genes to filter by

Click "Filter Marker Tables" to update the results based on your selection.

Cell Type Markers Filter Selection
πŸ“Š Marker Genes in Dataset

The first output is a marker gene table derived from our PBMC dataset. This table displays:

  • Gene: The marker gene
  • Cell Type: The target immune cell type
  • logFC: Logβ‚‚ fold change compared to other cell types
  • Adjusted p-value: FDR-corrected significance value

These represent high-confidence markers specific to the selected cell type.

☁️ Cell Type WordCloud

Below the marker table, a WordCloud visualizes cell types that are frequently and strongly associated with the selected genes. Larger font size indicates stronger or more frequent association.

πŸ“˜ Gene Γ— Database Summary

The second table provides a summary matrix showing whether each gene appears in major cell marker databases:

  • βœ… = Gene is listed in the database
  • ❌ = Gene is not listed

Columns include:

  • Gene
  • CellMarker
  • Human Protein Atlas (HPA)
  • MSigDB Immune
  • PanglaoDB

This table enables cross-referencing of dataset-derived markers with curated resources.

πŸ“š Database Annotations

The third and final table shows detailed database annotations for your selected genes:

  • Gene
  • Database: e.g., MSigDB Immune, CellMarker, etc.
  • Indicated Cell Type: The associated immune cell type from the database

This helps validate candidate markers or identify discrepancies between datasets and public annotations.

Cell Type Markers Table and Word Cloud

Gene Exploration

The Gene Exploration page allows you to investigate the expression patterns of specific genes across different cell types and biological conditions. This tool is perfect for validating gene expression patterns, exploring cell type-specific markers, or investigating genes of interest.

πŸŽ›οΈ Gene Selection

To begin your analysis:

  1. Enter Gene Names: Type gene symbols (e.g., CD3D, CD4, IL2) in the text area, one per line or comma-separated (maximum 10 genes)
  2. Select Splitting Options: Choose how to split the data by checking "Disease" and/or "Sex"
  3. Click Generate Plot: The system will create interactive visualizations
Gene Exploration Interface
πŸ“Š Expression Visualization

The analysis generates an interactive heatmap visualization:

πŸ”₯ Heatmap
  • X-axis: Cell types (broad resolution)
  • Y-axis: Selected genes
  • Color intensity: Gene expression level (red = high, blue = low)
  • Interactive features: Hover to see exact expression values
  • Data splitting: Can be split by disease and/or sex based on your selection
πŸ’‘ Use Cases
  • Marker Validation: Verify that known cell type markers are expressed in expected cell types
  • Gene Discovery: Explore expression patterns of novel genes
  • Comparative Studies: Compare expression of related genes across cell types
Pro Tip

Use the example links on the homepage to quickly explore common immune genes like CD3D, CD4, or IL2. These examples demonstrate typical expression patterns across PBMC cell types.

TCR/BCR (V(D)J) Analysis

The TCR/BCR (V(D)J) Dashboard provides comprehensive analysis of T-cell and B-cell receptor repertoires, including clonotype diversity, V/J gene usage, and top clones. This tool is essential for understanding adaptive immune responses and clonal expansion patterns.

πŸŽ›οΈ Sample Selection

The dashboard allows you to:

  • Select Specific Samples: Choose individual samples from the dropdown to analyze their TCR/BCR repertoires
  • View All Samples: Select "All samples" to see aggregated statistics across the entire dataset
  • Sample Metadata: When a specific sample is selected, metadata information is displayed including study, project, disease, sex, and age
VDJ Dashboard Sample Selection
πŸ“Š Key Metrics & Visualizations
πŸ“ˆ Key Performance Indicators (KPIs)
  • Total Cells: Number of cells with TCR/BCR data
  • Total Samples: Number of samples analyzed
🧬 V Gene Usage
  • Bar Chart: Shows the frequency of V gene usage across the selected sample(s)
  • Locus Filtering: Filter by specific loci (TRA, TRB, IGH, IGK, IGL)
  • Top 10 Display: Focuses on the most frequently used V genes
🧬 J Gene Usage
  • Bar Chart: Displays J gene usage patterns
  • Locus Filtering: Same filtering options as V gene usage
  • Comparative Analysis: Compare J gene usage between samples
πŸ“ CDR3 Length Distribution
  • Histogram: Shows the distribution of CDR3 sequence lengths
  • Sample Comparison: Compare CDR3 length patterns between samples
  • Biological Insights: CDR3 length can indicate T-cell maturation and selection
πŸ”— Chain Pairing
  • Pairing Matrix: Shows the frequency of Ξ±Ξ² chain pairings in T cells
  • Diversity Assessment: Helps understand TCR diversity and clonal expansion
V and J Gene Usage Bar Charts CDR3 Length Distribution and Chain Pairing
🎯 Top Clonotypes

The Top Clonotypes table provides detailed information about the most frequent TCR/BCR clones:

  • Clone ID: Unique identifier for each clonotype
  • Cell Count: Number of cells with this specific clonotype
  • CDR3 Sequence: The actual CDR3 amino acid sequence
  • View Chains: Click to see detailed chain information for the clonotype

When a specific sample is selected, the table shows "What are the top TCR clones in sample [sample_name]?" providing context about which sample's data you're viewing.

Top Clonotypes Table with Chain Details
πŸ’‘ Use Cases
  • Immune Response Analysis: Study clonal expansion during infection or vaccination
  • Disease Research: Compare TCR/BCR repertoires between healthy and diseased states
  • Sample Quality Control: Assess the quality and diversity of TCR/BCR data
  • Comparative Studies: Compare receptor usage patterns across different samples or conditions
Pro Tip

Use the example link on the homepage to quickly explore TCR clones in a specific sample. The sample metadata box provides important context about the biological sample you're analyzing.

Surface Proteins Analysis

The Surface Proteins page allows you to explore protein expression patterns across different diseases and cell types. This tool is particularly useful for understanding cell surface marker expression and identifying potential therapeutic targets.

πŸŽ›οΈ Protein Selection

To begin your analysis:

  1. Select Protein: Choose a surface protein from the dropdown menu (e.g., CD56, CD3, CD4)
  2. Select Cell Type: Choose a specific cell type to focus on, or select "All" for overall expression
  3. View Results: The plot automatically updates to show expression patterns
Surface Proteins Interface
πŸ“Š Expression Visualization

The analysis generates an interactive bar chart showing:

  • X-axis: Disease conditions (e.g., COVID-19, Healthy, Alzheimer's disease)
  • Y-axis: Mean protein expression level
  • Bars: Ordered by expression level (highest to lowest)
  • Interactive features: Hover over bars to see exact expression values
πŸ“‹ Available Proteins

The dataset includes a comprehensive set of surface proteins commonly used in immunology research:

  • CD Markers: CD3, CD4, CD8, CD56, CD19, CD14, etc.
  • Activation Markers: CD25, CD69, CD38, etc.
  • Co-stimulatory Molecules: CD28, CD80, CD86, etc.
  • Adhesion Molecules: CD11a, CD11b, CD11c, etc.
πŸ’‘ Use Cases
  • Biomarker Discovery: Identify proteins that are differentially expressed in specific diseases
  • Therapeutic Target Identification: Find surface proteins that could be targeted for therapy
  • Cell Type Characterization: Understand the surface protein profile of different cell types
  • Disease Comparison: Compare protein expression patterns across different disease conditions
Pro Tip

Use the example link on the homepage to quickly explore CD56 expression in NK cells. This demonstrates how surface protein expression can vary across different cell types and disease conditions.

Cell–Cell Communication

This section will outline how to interpret ligand–receptor interaction results, including summary plots and per-pair tables.

Coming soon: example analyses and visuals.

Downloads

In the Downloads section, you can access all major data outputs and resources generated from the PBMCpedia project. These files support further analysis, benchmarking, or model development in your own research environment.

🧬 1. Whole Dataset (.h5ad)

Download the complete single-cell PBMC dataset in AnnData (.h5ad) format (~160.8 GB), including:

  • Quality-controlled and filtered cells
  • Gene expression matrix (normalized, log-transformed, and scaled)
  • Cell type annotations (multiple resolutions)
  • Sample and donor metadata
  • Harmony-integrated embeddings for batch correction

This dataset is ready for downstream analysis in Python tools like Scanpy or scvi-tools.

πŸ§ͺ 2. Pseudobulks (.h5ad)

Download pseudo-bulk versions of the dataset where gene expression is aggregated per cell type:

  • Broad resolution (~245 MB) - Aggregated by broad cell type categories
  • Fine resolution (~287.3 MB) - Aggregated by fine-grained cell type annotations

This format is useful for bulk-like differential expression, pathway enrichment, or machine learning on summarized profiles.

πŸ“Š 3. Analysis Results (.csv)

Download pre-computed analysis results:

  • DEG Results (~20.1 KB) - Differential gene expression results comparing disease vs. control across cell types
  • Pathway Enrichment (~906.5 KB) - Pathway enrichment summary across cell types/contrasts with core stats (NES/FDR)

Ideal for downstream filtering, visualization, or integration with gene set databases.

🧬 4. VDJ Sequencing Data

Download TCR/BCR repertoire data:

  • VDJ AIRR Data - TCR/BCR sequencing data in AIRR format (parquet format for efficient storage)
  • VDJ Cell Metadata - VDJ cell metadata and annotations (compressed CSV format)
πŸ§ͺ 5. Protein/ADT Data (.csv)

Download protein expression data:

  • Protein Means by Disease - Mean protein expression by disease condition
Downloads Page Interface

All resources are made available to support transparent, reproducible, and scalable research on immune cells in health and disease.

API

PBMCpedia provides a RESTful API for programmatic access to its data and analysis results. This section outlines the available endpoints and how to use them.

πŸ”— API Endpoints

In production, all endpoints are served under the base path /pbmcpedia. Main endpoints:

  • /pbmcpedia/api/v1/degs/ - Differential gene expression
  • /pbmcpedia/api/v1/pathways/ - Pathway enrichment
  • /pbmcpedia/api/v1/metadata/ - Sample metadata
  • /pbmcpedia/api/v1/gene_expr_celltype/ - Gene expression per cell type
  • /pbmcpedia/api/v1/marker_table_ds/ - Dataset-specific marker genes
  • /pbmcpedia/api/v1/protein_list/ - Available proteins/ADT markers
  • /pbmcpedia/api/v1/adt_summary_api/ - ADT summary statistics
  • /pbmcpedia/api/v1/chains/by_clone - TCR/BCR chain info for clones
  • /pbmcpedia/api/v1/ping/ - API health check
πŸ› οΈ Example Requests

Below are example requests using curl and Python (using requests library).

πŸ”— curl Examples

To get differential gene expression results:

curl -G 'https://web.ccb.uni-saarland.de/pbmcpedia/api/v1/degs/' \
  --data-urlencode 'cell_type=CD4+ T cell' \
  --data-urlencode 'disease=COVID-19' \
  --data-urlencode 'limit=10' \
  --data-urlencode 'offset=0'

To get pathway enrichment results:

curl -G 'https://web.ccb.uni-saarland.de/pbmcpedia/api/v1/pathways/' \
  --data-urlencode 'cell_type=CD4+ T cell' \
  --data-urlencode 'resolution=broad' \
  --data-urlencode 'limit=10'

To get sample metadata:

curl -G 'https://web.ccb.uni-saarland.de/pbmcpedia/api/v1/metadata/' \
  --data-urlencode 'study=PRJNA1040889' \
  --data-urlencode 'limit=10'

Health check:

curl 'https://web.ccb.uni-saarland.de/pbmcpedia/api/ping/'
🐍 Python Example

Using the requests library:

import requests

base_url = "https://web.ccb.uni-saarland.de/pbmcpedia"

# DEGs
resp = requests.get(f"{base_url}/api/v1/degs/", params={
    "cell_type": "CD4+ T cell",
    "disease": "COVID-19",
    "limit": 10,
})
print(resp.json())

# Pathways
resp = requests.get(f"{base_url}/api/v1/pathways/", params={
    "cell_type": "CD4+ T cell",
    "resolution": "broad",
    "limit": 10,
})
print(resp.json())

# Metadata
resp = requests.get(f"{base_url}/api/v1/metadata/", params={
    "study": "PRJNA1040889",
    "limit": 10,
})
print(resp.json())
πŸ’‘ Tips
  • Most endpoints support query parameters for filtering (e.g., cell_type, disease, study)
  • Use the /api/ping/ endpoint to check if the API is available
  • For large datasets, consider using the download endpoints instead of API calls
  • All API responses are in JSON format
  • API docs: https://web.ccb.uni-saarland.de/pbmcpedia/api/docs/
-->