PBMCpedia Tutorial
Overview
PBMCpedia is a comprehensive, large-scale single-cell reference atlas of human peripheral blood mononuclear cells (PBMCs), created to facilitate deep exploration of immune cell diversity across age, sex, and disease states. This resource integrates 4.3M high-quality PBMC profiles from 25 publicly available single-cell RNA-sequencing (scRNA-seq) studies, harmonized and annotated using standardized and reproducible pipelines.
π― Main Functions
PBMCpedia provides several key analytical tools:
- π Dataset Overview - Explore sample metadata and study information
- 𧬠DEGs & Pathways - Identify differentially expressed genes and enriched pathways
- π¬ Cell Type Overview - Visualize cell type proportions and gene expression patterns
- π― Marker Comparison - Compare dataset-derived markers with reference databases
- π Gene Exploration - Explore gene expression across cell types and conditions
- 𧬠TCR/BCR Analysis - Analyze T-cell and B-cell receptor repertoires
- π¬ Surface Proteins - Explore protein expression patterns across diseases
π Data Processing Pipeline
We collected raw gene expression data and metadata from 25 studies available in the Sequence Read Archive (SRA), which were mapped against the human reference genome using CellRanger. Each sample was quality-controlled to remove poor-quality cells and low-complexity samples using CellBender. Scrublet was applied for doublet detection.
All data were processed through a uniform scRNA-seq pipeline using Scanpy and AnnData objects, consisting of:
- Filtering cells by mitochondrial content and gene count
- Normalization and log-transformation
- Highly variable gene selection
- Scaling and PCA
- Batch correction using Harmony
- Clustering and UMAP for visualization
- Cell type annotation using both manual expert curation and automated classification with CellTypist (AIFI_L1 to L3)
πΈ Graphical Abstract
π Homepage Example Cases
The homepage provides quick access to common analysis scenarios through example links:
Getting Started
Start by exploring the Dataset Overview to understand the available data, then use the example links on the homepage to quickly access specific analyses. Each section includes detailed instructions and interactive visualizations.
Dataset Overview & Metadata
The Dataset Overview page provides a comprehensive summary of all samples and studies included in PBMCpedia. This is your starting point for understanding the available data and planning your analyses.
π Disease Distribution Bar Chart
The first plot displays all diseases found in the metadata on the x-axis, with bars representing the number of associated samples per disease. It's an interactive bar chart: when you hover over a bar, a tooltip appears showing detailed summary information for the corresponding study, including:
- BioProject accession - Links to NCBI BioProject database
- Disease - The specific condition or healthy control
- Total sample count - Number of samples in the study
- Sex breakdown - Distribution of male/female/unknown donors
- Details - Short description of the study
- DOI - Link to the original publication
π Sample Metadata Table
Below the bar chart, you'll find a comprehensive metadata table with the following columns:
- Project - Project identifiers (e.g., P10, P13)
- BioSample -
SAMNaccession numbers (unique biosample identifiers) - BioProject -
PRJNAaccessions (NCBI BioProject study IDs) - Age - Age display information
- Sex - Male, Female, or Unknown
- Disease - Donor condition (e.g., COVID-19, Healthy, Alzheimer's disease)
π Interactive Features
- Search - Filter by typing keywords (e.g.,
COVID,female) - Sort - Click column headers to sort by any field
- Pagination - Navigate through large datasets
- Export - Download filtered results as CSV
Pro Tip
Use the metadata table to identify specific samples or studies for your analysis. The search function is particularly useful for finding samples from specific diseases or demographic groups.
DEGs & Pathways Analysis
The DEGs and Pathways page allows you to identify differentially expressed genes and enriched biological pathways across different subsets of the PBMC dataset. This is one of the most powerful analytical tools in PBMCpedia for understanding disease mechanisms and immune responses.
ποΈ Filter Selection
To begin your analysis, define your subset of interest by selecting the following filters:
- Cell Type & Resolution: Choose between broad (major cell types) or fine (detailed subtypes) resolution
- Disease: Focus on specific conditions (e.g., COVID-19, Alzheimer's disease) or compare healthy vs diseased states
- Sex: Filter by male, female, or all sex donors
- Age Group: Select age brackets - Young (0β24 yrs), Adult (25β64 yrs), Elderly (>65 yrs), or All
- Study: Focus on specific studies or analyze across all studies
After setting your filters, click Run Analysis to compute DEGs and associated pathways. The analysis typically takes a few seconds to complete.
𧬠Differentially Expressed Genes (DEGs)
The analysis results begin with an interactive volcano plot that visualizes the significance and magnitude of gene expression changes:
- X-axis: Log2 fold change (magnitude of expression difference)
- Y-axis: -log10(p-value) (statistical significance)
- Color coding: Red points indicate significantly upregulated genes, blue points show downregulated genes
- Interactive features: Hover over points to see gene names and statistics
You can manually highlight up to 10 specific genes by entering their names as a comma-separated list in the "Highlight Genes" field.
π DEG Results Table
Below the volcano plot, a detailed DEG table provides comprehensive information for each differentially expressed gene:
- Gene - Gene symbol and name
- log2FC - Log2 fold change (positive = upregulated, negative = downregulated)
- p-value - Statistical significance (raw p-value)
- Cell Type - The cell type where the gene is differentially expressed
- Age - Age group analyzed
- Sex - Sex group analyzed
- Disease - Disease condition compared
- Resolution - Cell type annotation resolution used
- Study - Study identifier
π Pathway Enrichment Analysis
Following the DEGs, a bar plot displays the top enriched pathways, ranked by statistical significance.
A detailed pathway table is also provided, including:
- Pathway - Pathway name
- Description - Detailed pathway description
- Database - Source database (e.g.,
BP) - p-value - Statistical significance
- Score - Enrichment score (calculated using GSEA)
- Cell Type - Cell type analyzed
- Age - Age group analyzed
- Sex - Sex group analyzed
- Disease - Disease condition compared
- Resolution - Cell type annotation resolution used
- Study - Study identifier
Cell Type Overview
The Cell Type Overview page provides an interactive UMAP visualization of all cell types in the dataset, allowing you to explore cell type distributions, proportions, and gene expression patterns.
πΊοΈ Interactive UMAP Visualization
The main feature is an interactive UMAP plot showing all cell types:
- Resolution Selection: Toggle between broad (major cell types) and fine (detailed subtypes) resolution
- Interactive Dots: Each dot represents a cell type, sized by the number of cells
- Click to Select: Click on any dot to view detailed information about that cell type
- Cell Type Information: Selected cell types show cell count and proportion of the dataset
π Cell Type Details Panel
When you select a cell type, the right panel displays:
- Cell Count: Number of cells of this type in the dataset
- Proportion: Percentage of total cells
- Marker Genes Link: Direct access to marker gene analysis for this cell type
- Dropdown Selection: Alternative way to select cell types
𧬠Gene Expression Dot Plot
The page also includes a gene expression dot plot feature:
- Gene Input: Enter up to 10 genes (comma-separated)
- Resolution Selection: Choose broad or fine resolution for analysis
- Interactive Plot: Shows gene expression across cell types
- Error Handling: Provides suggestions for gene names not found
π‘ Use Cases
- Cell Type Exploration: Discover the diversity of immune cell types in the dataset
- Proportion Analysis: Understand the relative abundance of different cell types
- Gene Expression Validation: Verify expected gene expression patterns in specific cell types
- Marker Gene Discovery: Use the dot plot to identify cell type-specific markers
Pro Tip
Start with the broad resolution to get an overview of major cell types, then switch to fine resolution to explore detailed subtypes. Use the example links on the homepage to quickly access specific cell types like Memory CD8 T cells.
Marker Comparison
The Cell Type Markers section allows you to explore both known and dataset-derived marker genes for specific immune cell types.
ποΈ Selection Options
To begin, choose the following from the sidebar:
- Cell Type Resolution: Broad or Fine
- Cell Type: Select the specific population of interest
- Genes: (Optional) Enter a comma-separated list of genes to filter by
Click "Filter Marker Tables" to update the results based on your selection.
π Marker Genes in Dataset
The first output is a marker gene table derived from our PBMC dataset. This table displays:
- Gene: The marker gene
- Cell Type: The target immune cell type
- logFC: Logβ fold change compared to other cell types
- Adjusted p-value: FDR-corrected significance value
These represent high-confidence markers specific to the selected cell type.
βοΈ Cell Type WordCloud
Below the marker table, a WordCloud visualizes cell types that are frequently and strongly associated with the selected genes. Larger font size indicates stronger or more frequent association.
π Gene Γ Database Summary
The second table provides a summary matrix showing whether each gene appears in major cell marker databases:
- β = Gene is listed in the database
- β = Gene is not listed
Columns include:
- Gene
- CellMarker
- Human Protein Atlas (HPA)
- MSigDB Immune
- PanglaoDB
This table enables cross-referencing of dataset-derived markers with curated resources.
π Database Annotations
The third and final table shows detailed database annotations for your selected genes:
- Gene
- Database: e.g., MSigDB Immune, CellMarker, etc.
- Indicated Cell Type: The associated immune cell type from the database
This helps validate candidate markers or identify discrepancies between datasets and public annotations.
Gene Exploration
The Gene Exploration page allows you to investigate the expression patterns of specific genes across different cell types and biological conditions. This tool is perfect for validating gene expression patterns, exploring cell type-specific markers, or investigating genes of interest.
ποΈ Gene Selection
To begin your analysis:
- Enter Gene Names: Type gene symbols (e.g.,
CD3D, CD4, IL2) in the text area, one per line or comma-separated (maximum 10 genes) - Select Splitting Options: Choose how to split the data by checking "Disease" and/or "Sex"
- Click Generate Plot: The system will create interactive visualizations
π Expression Visualization
The analysis generates an interactive heatmap visualization:
π₯ Heatmap
- X-axis: Cell types (broad resolution)
- Y-axis: Selected genes
- Color intensity: Gene expression level (red = high, blue = low)
- Interactive features: Hover to see exact expression values
- Data splitting: Can be split by disease and/or sex based on your selection
π‘ Use Cases
- Marker Validation: Verify that known cell type markers are expressed in expected cell types
- Gene Discovery: Explore expression patterns of novel genes
- Comparative Studies: Compare expression of related genes across cell types
Pro Tip
Use the example links on the homepage to quickly explore common immune genes like CD3D, CD4, or IL2. These examples demonstrate typical expression patterns across PBMC cell types.
TCR/BCR (V(D)J) Analysis
The TCR/BCR (V(D)J) Dashboard provides comprehensive analysis of T-cell and B-cell receptor repertoires, including clonotype diversity, V/J gene usage, and top clones. This tool is essential for understanding adaptive immune responses and clonal expansion patterns.
ποΈ Sample Selection
The dashboard allows you to:
- Select Specific Samples: Choose individual samples from the dropdown to analyze their TCR/BCR repertoires
- View All Samples: Select "All samples" to see aggregated statistics across the entire dataset
- Sample Metadata: When a specific sample is selected, metadata information is displayed including study, project, disease, sex, and age
π Key Metrics & Visualizations
π Key Performance Indicators (KPIs)
- Total Cells: Number of cells with TCR/BCR data
- Total Samples: Number of samples analyzed
𧬠V Gene Usage
- Bar Chart: Shows the frequency of V gene usage across the selected sample(s)
- Locus Filtering: Filter by specific loci (TRA, TRB, IGH, IGK, IGL)
- Top 10 Display: Focuses on the most frequently used V genes
𧬠J Gene Usage
- Bar Chart: Displays J gene usage patterns
- Locus Filtering: Same filtering options as V gene usage
- Comparative Analysis: Compare J gene usage between samples
π CDR3 Length Distribution
- Histogram: Shows the distribution of CDR3 sequence lengths
- Sample Comparison: Compare CDR3 length patterns between samples
- Biological Insights: CDR3 length can indicate T-cell maturation and selection
π Chain Pairing
- Pairing Matrix: Shows the frequency of Ξ±Ξ² chain pairings in T cells
- Diversity Assessment: Helps understand TCR diversity and clonal expansion
π― Top Clonotypes
The Top Clonotypes table provides detailed information about the most frequent TCR/BCR clones:
- Clone ID: Unique identifier for each clonotype
- Cell Count: Number of cells with this specific clonotype
- CDR3 Sequence: The actual CDR3 amino acid sequence
- View Chains: Click to see detailed chain information for the clonotype
When a specific sample is selected, the table shows "What are the top TCR clones in sample [sample_name]?" providing context about which sample's data you're viewing.
π‘ Use Cases
- Immune Response Analysis: Study clonal expansion during infection or vaccination
- Disease Research: Compare TCR/BCR repertoires between healthy and diseased states
- Sample Quality Control: Assess the quality and diversity of TCR/BCR data
- Comparative Studies: Compare receptor usage patterns across different samples or conditions
Pro Tip
Use the example link on the homepage to quickly explore TCR clones in a specific sample. The sample metadata box provides important context about the biological sample you're analyzing.
Surface Proteins Analysis
The Surface Proteins page allows you to explore protein expression patterns across different diseases and cell types. This tool is particularly useful for understanding cell surface marker expression and identifying potential therapeutic targets.
ποΈ Protein Selection
To begin your analysis:
- Select Protein: Choose a surface protein from the dropdown menu (e.g., CD56, CD3, CD4)
- Select Cell Type: Choose a specific cell type to focus on, or select "All" for overall expression
- View Results: The plot automatically updates to show expression patterns
π Expression Visualization
The analysis generates an interactive bar chart showing:
- X-axis: Disease conditions (e.g., COVID-19, Healthy, Alzheimer's disease)
- Y-axis: Mean protein expression level
- Bars: Ordered by expression level (highest to lowest)
- Interactive features: Hover over bars to see exact expression values
π Available Proteins
The dataset includes a comprehensive set of surface proteins commonly used in immunology research:
- CD Markers: CD3, CD4, CD8, CD56, CD19, CD14, etc.
- Activation Markers: CD25, CD69, CD38, etc.
- Co-stimulatory Molecules: CD28, CD80, CD86, etc.
- Adhesion Molecules: CD11a, CD11b, CD11c, etc.
π‘ Use Cases
- Biomarker Discovery: Identify proteins that are differentially expressed in specific diseases
- Therapeutic Target Identification: Find surface proteins that could be targeted for therapy
- Cell Type Characterization: Understand the surface protein profile of different cell types
- Disease Comparison: Compare protein expression patterns across different disease conditions
Pro Tip
Use the example link on the homepage to quickly explore CD56 expression in NK cells. This demonstrates how surface protein expression can vary across different cell types and disease conditions.
CellβCell Communication
This section will outline how to interpret ligandβreceptor interaction results, including summary plots and per-pair tables.
Coming soon: example analyses and visuals.
Downloads
In the Downloads section, you can access all major data outputs and resources generated from the PBMCpedia project. These files support further analysis, benchmarking, or model development in your own research environment.
𧬠1. Whole Dataset (.h5ad)
Download the complete single-cell PBMC dataset in AnnData (.h5ad) format (~160.8 GB), including:
- Quality-controlled and filtered cells
- Gene expression matrix (normalized, log-transformed, and scaled)
- Cell type annotations (multiple resolutions)
- Sample and donor metadata
- Harmony-integrated embeddings for batch correction
This dataset is ready for downstream analysis in Python tools like Scanpy or scvi-tools.
π§ͺ 2. Pseudobulks (.h5ad)
Download pseudo-bulk versions of the dataset where gene expression is aggregated per cell type:
- Broad resolution (~245 MB) - Aggregated by broad cell type categories
- Fine resolution (~287.3 MB) - Aggregated by fine-grained cell type annotations
This format is useful for bulk-like differential expression, pathway enrichment, or machine learning on summarized profiles.
π 3. Analysis Results (.csv)
Download pre-computed analysis results:
- DEG Results (~20.1 KB) - Differential gene expression results comparing disease vs. control across cell types
- Pathway Enrichment (~906.5 KB) - Pathway enrichment summary across cell types/contrasts with core stats (NES/FDR)
Ideal for downstream filtering, visualization, or integration with gene set databases.
𧬠4. VDJ Sequencing Data
Download TCR/BCR repertoire data:
- VDJ AIRR Data - TCR/BCR sequencing data in AIRR format (parquet format for efficient storage)
- VDJ Cell Metadata - VDJ cell metadata and annotations (compressed CSV format)
π§ͺ 5. Protein/ADT Data (.csv)
Download protein expression data:
- Protein Means by Disease - Mean protein expression by disease condition
All resources are made available to support transparent, reproducible, and scalable research on immune cells in health and disease.
API
PBMCpedia provides a RESTful API for programmatic access to its data and analysis results. This section outlines the available endpoints and how to use them.
π API Endpoints
In production, all endpoints are served under the base path /pbmcpedia. Main endpoints:
/pbmcpedia/api/v1/degs/- Differential gene expression/pbmcpedia/api/v1/pathways/- Pathway enrichment/pbmcpedia/api/v1/metadata/- Sample metadata/pbmcpedia/api/v1/gene_expr_celltype/- Gene expression per cell type/pbmcpedia/api/v1/marker_table_ds/- Dataset-specific marker genes/pbmcpedia/api/v1/protein_list/- Available proteins/ADT markers/pbmcpedia/api/v1/adt_summary_api/- ADT summary statistics/pbmcpedia/api/v1/chains/by_clone- TCR/BCR chain info for clones/pbmcpedia/api/v1/ping/- API health check
π οΈ Example Requests
Below are example requests using curl and Python (using requests library).
π curl Examples
To get differential gene expression results:
curl -G 'https://web.ccb.uni-saarland.de/pbmcpedia/api/v1/degs/' \
--data-urlencode 'cell_type=CD4+ T cell' \
--data-urlencode 'disease=COVID-19' \
--data-urlencode 'limit=10' \
--data-urlencode 'offset=0'
To get pathway enrichment results:
curl -G 'https://web.ccb.uni-saarland.de/pbmcpedia/api/v1/pathways/' \
--data-urlencode 'cell_type=CD4+ T cell' \
--data-urlencode 'resolution=broad' \
--data-urlencode 'limit=10'
To get sample metadata:
curl -G 'https://web.ccb.uni-saarland.de/pbmcpedia/api/v1/metadata/' \
--data-urlencode 'study=PRJNA1040889' \
--data-urlencode 'limit=10'
Health check:
curl 'https://web.ccb.uni-saarland.de/pbmcpedia/api/ping/'
π Python Example
Using the requests library:
import requests
base_url = "https://web.ccb.uni-saarland.de/pbmcpedia"
# DEGs
resp = requests.get(f"{base_url}/api/v1/degs/", params={
"cell_type": "CD4+ T cell",
"disease": "COVID-19",
"limit": 10,
})
print(resp.json())
# Pathways
resp = requests.get(f"{base_url}/api/v1/pathways/", params={
"cell_type": "CD4+ T cell",
"resolution": "broad",
"limit": 10,
})
print(resp.json())
# Metadata
resp = requests.get(f"{base_url}/api/v1/metadata/", params={
"study": "PRJNA1040889",
"limit": 10,
})
print(resp.json())
π‘ Tips
- Most endpoints support query parameters for filtering (e.g.,
cell_type,disease,study) - Use the
/api/ping/endpoint to check if the API is available - For large datasets, consider using the download endpoints instead of API calls
- All API responses are in JSON format
- API docs:
https://web.ccb.uni-saarland.de/pbmcpedia/api/docs/