CellTypist¶
Cell types classify cells based on public and private knowledge from studying transcription, morphology, function & other properties. Established cell types have well-characterized markers and properties; however, cell subtypes and states are continuously being discovered, refined and better understood.
In this notebook, we register the immune cell type vocabulary from CellTypist, a computational tool used for cell type classification in scRNA-seq data.
In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to curate datasets analyzed with CellTypist enrichment analysis and track the dataset with LaminDB.
# pip install 'lamindb[jupyter,bionty]'
!lamin load use-cases-registries
Show code cell output
Entity has to be a laminhub URL or 'artifact' or 'transform'
Show code cell content
# filter warnings from celltypist
import warnings
warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import lamindb as ln
import bionty as bt
→ connected lamindb: testuser1/use-cases-registries
Access CellTypist records  ¶
¶
As a first step we will read in CellTypist’s immune cell encyclopedia
import pandas as pd
description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"
celltypist_df = pd.read_excel(celltypist_source_v2_url)
It provides an ontology_id of the public Cell Ontology for the majority of records.
celltypist_df.head()
| High-hierarchy cell types | Low-hierarchy cell types | Description | Cell Ontology ID | Curated markers | |
|---|---|---|---|---|---|
| 0 | B cells | B cells | B lymphocytes with diverse cell surface immuno... | CL:0000236 | CD79A, MS4A1, CD19 | 
| 1 | B cells | Follicular B cells | resting mature B lymphocytes found in the prim... | CL:0000843 | CXCR5, TNFRSF13B, CD22 | 
| 2 | B cells | Proliferative germinal center B cells | proliferating germinal center B cells | CL:0000844 | MKI67, SUGCT, AICDA | 
| 3 | B cells | Germinal center B cells | proliferating mature B cells that undergo soma... | CL:0000844 | POU2AF1, CD40, SUGCT | 
| 4 | B cells | Memory B cells | long-lived mature B lymphocytes which are form... | CL:0000787 | CR2, CD27, MS4A1 | 
The “Cell Ontology ID” is associated with multiple “Low-hierarchy cell types”:
celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)
| High-hierarchy cell types | Description | Curated markers | ||
|---|---|---|---|---|
| Cell Ontology ID | Low-hierarchy cell types | |||
| CL:0000236 | B cells | B cells | B lymphocytes with diverse cell surface immuno... | CD79A, MS4A1, CD19 | 
| CL:0000843 | Follicular B cells | B cells | resting mature B lymphocytes found in the prim... | CXCR5, TNFRSF13B, CD22 | 
| CL:0000844 | Proliferative germinal center B cells | B cells | proliferating germinal center B cells | MKI67, SUGCT, AICDA | 
| Germinal center B cells | B cells | proliferating mature B cells that undergo soma... | POU2AF1, CD40, SUGCT | |
| CL:0000787 | Memory B cells | B cells | long-lived mature B lymphocytes which are form... | CR2, CD27, MS4A1 | 
| Age-associated B cells | B cells | CD11c+ T-bet+ memory B cells associated with a... | FCRL2, ITGAX, TBX21 | |
| CL:0000788 | Naive B cells | B cells | mature B lymphocytes which express cell-surfac... | IGHM, IGHD, TCL1A | 
| CL:0000818 | Transitional B cells | B cells | immature B cell precursors in the bone marrow ... | CD24, MYO1C, MS4A1 | 
| CL:0000817 | Large pre-B cells | B-cell lineage | proliferative B lymphocyte precursors derived ... | MME, CD24, MKI67 | 
| Small pre-B cells | B-cell lineage | non-proliferative B lymphocyte precursors deri... | MME, CD24, IGLL5 | 
Validate CellTypist records  ¶
¶
For any cell type record that can be validated against the public Cell Ontology, we’d like to ensure that it’s actually validated.
This will avoid that we’ll refer to the same cell type with different identifiers.
We need a Bionty object for this:
bionty = bt.CellType.public()
bionty
PublicOntology
Entity: CellType
Organism: all
Source: cl, 2024-08-16
#terms: 2959
We can now validate the "Cell Ontology ID" column:
bionty.inspect(celltypist_df["Cell Ontology ID"], bionty.ontology_id);
This looks good!
But when inspecting the names, most of them don’t validate:
bionty.inspect(celltypist_df["Low-hierarchy cell types"], bionty.name);
! 97 unique terms (99.00%) are not validated for name: 'B cells', 'Follicular B cells', 'Proliferative germinal center B cells', 'Germinal center B cells', 'Memory B cells', 'Age-associated B cells', 'Naive B cells', 'Transitional B cells', 'Large pre-B cells', 'Small pre-B cells', ...
   detected 6 unique terms with synonyms: DC1, DC2, ETP, ILC2, ILC3, pDC
→  standardize terms via .standardize()
A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology:
celltypist_df["Low-hierarchy cell types"][0]
'B cells'
bionty.search(celltypist_df["Low-hierarchy cell types"][0]).head(2)
| name | definition | synonyms | parents | __agg__ | |
|---|---|---|---|---|---|
| ontology_id | |||||
| CL:0000156 | obsolete antibody secreting cell | Obsolete: A Cell Of The Lymphoid Series That C... | None | [] | obsolete antibody secreting cell | 
| CL:0000432 | reticular cell | A Fibroblast That Synthesizes Collagen And Use... | reticulum cell | [CL:0000057] | reticular cell | 
Let’s try to strip "s" and inspect if more names are now validated. Yes, there are!
bionty.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    bionty.name,
);
! 93 unique terms (94.90%) are not validated for name: 'Follicular B cell', 'Proliferative germinal center B cell', 'Germinal center B cell', 'Memory B cell', 'Age-associated B cell', 'Naive B cell', 'Transitional B cell', 'Large pre-B cell', 'Small pre-B cell', 'Pre-pro-B cell', ...
   detected 35 unique terms with inconsistent casing/synonyms: Follicular B cell, Germinal center B cell, Memory B cell, Naive B cell, Transitional B cell, Small pre-B cell, Pro-B cell, Cycling B cell, Cycling gamma-delta T cell, Cycling monocyte, ...
→  standardize terms via .standardize()
Every “low-hierarchy cell type” has an ontology id and most “high-hierarchy cell types” also appear as “low-hierarchy cell types” in the Cell Typist table. Four, however, don’t, and therefore don’t have an ontology ID.
high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()
high_terms_nonval = set(high_terms).difference(low_terms)
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}
Register CellTypist records  ¶
¶
Let’s first add the “High-hierarchy cell types” as a column "parent".
This enables LaminDB to populate the parents and children fields, which will enable you to query for hierarchical relationships.
celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")
# if high and low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None
# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "ct_name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()
# add standardize names for each ontology_id
celltypist_df["name"] = bionty.df().loc[celltypist_df["ontology_id"]].name.values
celltypist_df.head(2)
| ct_name | description | ontology_id | parent | name | |
|---|---|---|---|---|---|
| 0 | B cells | B lymphocytes with diverse cell surface immuno... | CL:0000236 | None | B cell | 
| 1 | Follicular B cells | resting mature B lymphocytes found in the prim... | CL:0000843 | B cells | follicular B cell | 
Now, let’s create records from the public ontology:
public_records = bt.CellType.from_values(
    celltypist_df.ontology_id, bt.CellType.ontology_id
)
ln.save(public_records)
Let’s now amend public ontology records so that they maintain additional annotations that Cell Typist might have.
from lamindb.core.exceptions import ValidationError
public_records_dict = {r.ontology_id: r for r in public_records}
for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    try:
        record.add_synonym(row["ct_name"])
    except ValidationError:  # do nothing if the synonym already exists as a record
        pass
Show code cell output
✗ input synonyms ['DC2'] already associated with the following records:
| _branch_code | space_id | _aux | created_at | created_by_id | run_id | updated_at | source_id | id | uid | name | ontology_id | abbr | synonyms | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | None | 2025-03-10 13:28:13.370000+00:00 | 1 | None | 2025-03-10 13:28:13.370000+00:00 | 32 | 92 | 3JO0EdVd | plasmacytoid dendritic cell | CL:0000784 | None | plasmacytoid monocyte|interferon-producing cel... | A Dendritic Cell Type Of Distinct Morphology, ... | 
✗ input synonyms ['ILC2'] already associated with the following records:
| _branch_code | space_id | _aux | created_at | created_by_id | run_id | updated_at | source_id | id | uid | name | ontology_id | abbr | synonyms | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | None | 2025-03-10 13:28:13.370000+00:00 | 1 | None | 2025-03-10 13:28:13.370000+00:00 | 32 | 114 | 4ny4oBnr | group 2 innate lymphoid cell | CL:0001069 | None | ILC2|natural helper cell|nuocyte | An Innate Lymphoid Cell That Is Capable Of Pro... | 
✗ input synonyms ['ILC3'] already associated with the following records:
| _branch_code | space_id | _aux | created_at | created_by_id | run_id | updated_at | source_id | id | uid | name | ontology_id | abbr | synonyms | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | None | 2025-03-10 13:28:13.370000+00:00 | 1 | None | 2025-03-10 13:28:13.370000+00:00 | 32 | 115 | 3tILnbqv | group 3 innate lymphoid cell | CL:0001071 | None | ILC3 | An Innate Lymphoid Cell That Constituitively E... | 
✗ input synonyms ['pDC'] already associated with the following records:
| _branch_code | space_id | _aux | created_at | created_by_id | run_id | updated_at | source_id | id | uid | name | ontology_id | abbr | synonyms | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | None | 2025-03-10 13:28:13.370000+00:00 | 1 | None | 2025-03-10 13:28:13.370000+00:00 | 32 | 92 | 3JO0EdVd | plasmacytoid dendritic cell | CL:0000784 | None | plasmacytoid monocyte|interferon-producing cel... | A Dendritic Cell Type Of Distinct Morphology, ... | 
Add parent-child relationship of the records from Celltypist¶
We still need to add the renaming 4 High hierarchy terms:
list(high_terms_nonval)
['Erythroid', 'T cells', 'B-cell lineage', 'Cycling cells']
Let’s get the top hits from a search:
for term in list(high_terms_nonval):
    print(f"Term: {term}")
    display(bionty.search(term).head(2))
Term: Erythroid
| name | definition | synonyms | parents | __agg__ | |
|---|---|---|---|---|---|
| ontology_id | |||||
| CL:0002000 | Kit-positive erythroid progenitor cell | An Erythroid Progenitor Cell Is Kit-Positive, ... | c- Kit-positive erythroid progenitor cell | [CL:0001066] | kit-positive erythroid progenitor cell | 
| CL:0000038 | erythroid progenitor cell | A Progenitor Cell Committed To The Erythroid L... | None | [CL:0000839, CL:0000764] | erythroid progenitor cell | 
Term: T cells
| name | definition | synonyms | parents | __agg__ | |
|---|---|---|---|---|---|
| ontology_id | |||||
| CL:0000145 | professional antigen presenting cell | A Cell Capable Of Processing And Presenting Li... | None | [CL:0000738] | professional antigen presenting cell | 
| CL:0000432 | reticular cell | A Fibroblast That Synthesizes Collagen And Use... | reticulum cell | [CL:0000057] | reticular cell | 
Term: B-cell lineage
| name | definition | synonyms | parents | __agg__ | |
|---|---|---|---|---|---|
| ontology_id | 
Term: Cycling cells
| name | definition | synonyms | parents | __agg__ | |
|---|---|---|---|---|---|
| ontology_id | 
So we decide to:
- Add the “T cells” to the synonyms of the public “T cell” record 
- Create the remaining 3 terms only using their names (we think “B cell flow” shouldn’t be identified with “B cell”) 
for name in high_terms_nonval:
    if name == "T cells":
        record = bt.CellType.from_source(name="T cell")
        record.add_synonym(name)
        record.save()
    elif name == "Erythroid":
        record = bt.CellType.from_source(name="erythroid lineage cell")
        record.add_synonym(name)
        record.save()
    else:
        record = bt.CellType(name=name)
        record.save()
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}
bt.CellType(name="B-cell lineage").save()
→ returning existing CellType record with same name: 'B-cell lineage'
CellType(uid='5gxL2SWr', name='B-cell lineage', space_id=1, created_by_id=1, created_at=2025-03-10 13:28:15 UTC)
Now let’s add the parent records:
celltypist_df["parent"] = bt.CellType.standardize(celltypist_df["parent"])
for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    if row["parent"] is not None:
        parent_record = bt.CellType.get(name=row["parent"])
        record.parents.add(parent_record)
Access the registry¶
The previously added CellTypist ontology registry is now available in LaminDB.
To retrieve the full ontology table as a Pandas DataFrame we can use .filter:
bt.CellType.df()
| uid | name | ontology_id | abbr | synonyms | description | space_id | source_id | run_id | created_at | created_by_id | _aux | _branch_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||
| 139 | 5jshKSVL | Cycling cells | None | None | None | None | 1 | NaN | None | 2025-03-10 13:28:15.246000+00:00 | 1 | None | 1 | 
| 138 | 5gxL2SWr | B-cell lineage | None | None | None | None | 1 | NaN | None | 2025-03-10 13:28:15.242000+00:00 | 1 | None | 1 | 
| 69 | 4bKGljt0 | cell | CL:0000000 | None | None | A Material Entity Of Anatomical Origin (Part O... | 1 | 32.0 | None | 2025-03-10 13:28:13.370000+00:00 | 1 | None | 1 | 
| 70 | 4y4o4m6R | blood cell | CL:0000081 | None | None | A Cell Found Predominately In The Blood. | 1 | 32.0 | None | 2025-03-10 13:28:13.370000+00:00 | 1 | None | 1 | 
| 71 | 6Sq9ZVSG | professional antigen presenting cell | CL:0000145 | None | None | A Cell Capable Of Processing And Presenting Li... | 1 | 32.0 | None | 2025-03-10 13:28:13.370000+00:00 | 1 | None | 1 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 25 | 6YazXirC | thymocyte | CL:0000893 | None | ETP|thymic lymphocyte | An Immature T Cell Located In The Thymus. | 1 | 32.0 | None | 2025-03-10 13:28:12.964000+00:00 | 1 | None | 1 | 
| 26 | zQ4dyjEs | fibroblast | CL:0000057 | None | Fibroblasts | A Connective Tissue Cell Which Secretes An Ext... | 1 | 32.0 | None | 2025-03-10 13:28:12.964000+00:00 | 1 | None | 1 | 
| 27 | bgoqqGYM | granulocyte | CL:0000094 | None | polymorphonuclear leukocyte|granular leucocyte... | A Leukocyte With Abundant Granules In The Cyto... | 1 | 32.0 | None | 2025-03-10 13:28:12.964000+00:00 | 1 | None | 1 | 
| 28 | 6rfrjhvo | neutrophil | CL:0000775 | None | neutrocyte|neutrophilic leucocyte|neutrophil l... | Any Of The Immature Or Mature Forms Of A Granu... | 1 | 32.0 | None | 2025-03-10 13:28:12.964000+00:00 | 1 | None | 1 | 
| 29 | 1HNi1cpn | common myeloid progenitor | CL:0000049 | None | common myeloid precursor|CMP | A Progenitor Cell Committed To Myeloid Lineage... | 1 | 32.0 | None | 2025-03-10 13:28:12.964000+00:00 | 1 | None | 1 | 
100 rows × 13 columns
This enables us to look for cell types by creating a lookup object from our new CellType registry.
db_lookup = bt.CellType.lookup()
db_lookup.memory_b_cell
CellType(uid='2cUPBtY8', name='memory B cell', ontology_id='CL:0000787', synonyms='memory B lymphocyte|Memory B cells|memory B-cell|memory B-lymphocyte|Age-associated B cells', description='A Memory B Cell Is A Mature B Cell That Is Long-Lived, Readily Activated Upon Re-Encounter Of Its Antigenic Determinant, And Has Been Selected For Expression Of Higher Affinity Immunoglobulin. This Cell Type Has The Phenotype Cd19-Positive, Cd20-Positive, Mhc Class Ii-Positive, And Cd138-Negative.', space_id=1, created_by_id=1, source_id=32, created_at=2025-03-10 13:28:12 UTC)
See cell type hierarchy:
db_lookup.memory_b_cell.view_parents()
Access parents of a record:
db_lookup.memory_b_cell.parents.list()
[CellType(uid='ryEtgi1y', name='B cell', ontology_id='CL:0000236', synonyms='B cells|B-cell|B lymphocyte|B-lymphocyte|Cycling B cells', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', space_id=1, created_by_id=1, source_id=32, created_at=2025-03-10 13:28:12 UTC),
 CellType(uid='71xItrKo', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B-cell|mature B lymphocyte|mature B-lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', space_id=1, created_by_id=1, source_id=32, created_at=2025-03-10 13:28:13 UTC)]
Move on to the next registry: GO pathways