Jupyter Notebook

Gene Ontology (GO)

Pathways represent interconnected molecular networks of signaling cascades that govern critical cellular processes. They provide understandings cellular behavior mechanisms, insights of disease progression and treatment responses. In an R&D organization, managing pathways across different datasets are crucial for gaining insights of potential therapeutic targets and intervention strategies.

In this notebook we manage a pathway registry based on “2023 GO Biological Process” ontology. We’ll walk you through the steps of registering pathways and link them to genes.

In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to perform a pathway enrichment analysis and track the dataset with LaminDB.

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./use-cases-registries --schema bionty
Hide code cell output
 initialized lamindb: testuser1/use-cases-registries
import lamindb as ln
import bionty as bt
import gseapy as gp

bt.settings.organism = "human"  # globally set organism
 connected lamindb: testuser1/use-cases-registries

Fetch GO pathways annotated with human genes using Enrichr

First we fetch the “GO_Biological_Process_2023” pathways for humans using GSEApy which wraps GSEA and Enrichr.

go_bp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")
print(f"Number of pathways {len(go_bp)}")
Number of pathways 5406
go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF']

Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}

def parse_ontology_id_from_keys(key):
    """Parse out the ontology id.

    "ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
    """
    name, id = key.rsplit(" (", 1)
    id = id.rstrip(")")
    return id, name
go_bp_parsed = {}

for key, genes in go_bp.items():
    id, name = parse_ontology_id_from_keys(key)
    go_bp_parsed[id] = (name, genes)
go_bp_parsed["GO:0036500"]
('ATF6-mediated Unfolded Protein Response',
 ['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF'])

Register pathway ontology in LaminDB

bionty = bt.Pathway.public()
bionty
Hide code cell output
PublicOntology
Entity: Pathway
Organism: all
Source: go, 2024-06-17
#terms: 47856

Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.

Register pathway terms

To register the pathways we make use of .from_values to directly parse the annotated GO pathway ontology IDs into LaminDB.

pathway_records = bt.Pathway.from_values(go_bp_parsed.keys(), bt.Pathway.ontology_id)
ln.save(pathway_records)

Register gene symbols

Similarly, we use .from_values for all Pathway associated genes to register them with LaminDB.

all_genes = bt.Gene.standardize(list({g for genes in go_bp.values() for g in genes}))
gene_records = bt.Gene.from_values(all_genes, bt.Gene.symbol)
ln.save(gene_records);
! found 56 synonyms in Bionty (output truncated): [np.str_('C10ORF71'), np.str_('C10ORF90'), np.str_('ADGRF2'), np.str_('C1ORF131'), np.str_('C17ORF97'), np.str_('C1ORF112'), np.str_('SLC9A3R2'), np.str_('C15ORF62'), np.str_('DUSP13'), np.str_('C3ORF70'), '...']
  please add corresponding Gene records via (output truncated): `.from_values([np.str_('C10ORF71'), np.str_('C10ORF90'), np.str_('ADGRF2'), np.str_('C1ORF131'), np.str_('C17ORF97'), np.str_('C1ORF112'), np.str_('SLC9A3R2'), np.str_('C15ORF62'), np.str_('DUSP13'), np.str_('C3ORF70'), '...'])`
! ambiguous validation in Bionty for 1104 records: 'MYH11', 'SYNRG', 'LILRA4', 'CLDN23', 'PRB3', 'LGALS7', 'NPBWR2', 'NOBOX', 'TAS2R8', 'HHAT', 'DLL1', 'ABHD16A', 'SLC35A2', 'ZNF598', 'KIR2DL4', 'TNXB', 'IFI27L1', 'OR10J1', 'PCDHB16', 'H3-4', ...
! did not create Gene records for 38 non-validated symbols: 'AFD1', 'AZF1', 'CCL3L1', 'CCL4L1', 'DGS2', 'DUX3', 'DUX5', 'FOXL3-OT1', 'IGL', 'LOC100653049', 'LOC102723475', 'LOC102723996', 'LOC102724159', 'LOC107984156', 'LOC112268384', 'LOC122319436', 'LOC122513141', 'LOC122539214', 'LOC344967', 'MDRV', ...

Manually register the 37 non-validated symbols:

inspect_result = bt.Gene.inspect(all_genes, bt.Gene.symbol)

nonval_genes = []
for g in inspect_result.non_validated:
    nonval_genes.append(bt.Gene(symbol=g))

ln.save(nonval_genes)
! received 14696 unique terms, 1 empty/duplicated term is ignored
! 38 unique terms (0.30%) are not validated for symbol: 'MTRNR2L1', 'MDRV', 'CCL3L1', 'FOXL3-OT1', 'TAS2R36', 'MTRNR2L2', 'LOC107984156', 'LOC122513141', 'LOC344967', 'MTRNR2L6', ...
   couldn't validate 38 terms: 'LOC344967', 'AFD1', 'DUX5', 'MTRNR2L11', 'MTRNR2L8', 'MTRNR2L13', 'TAS2R33', 'MTRNR2L5', 'LOC102723475', 'LOC112268384', 'IGL', 'LOC122513141', 'MTRNR2L7', 'MTRNR2L6', 'MDRV', 'CCL4L1', 'MTRNR2L12', 'TRA', 'LOC100653049', 'MTRNR2L3', ...
→  if you are sure, create new records via Gene() and save to your registry
! records with similar symbols exist! did you mean to load one of them?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
9732 F36vWNagki41 CCL3 None ENSG00000274221 6348 protein_coding LD78|G0S19-1|SCYA3|LD78ALPHA|SCI|MIP-1-ALPHA C-C motif chemokine ligand 3 1 11 1 None 2025-03-10 13:27:45.203000+00:00 1 None 1
9733 EBoWJxh70guC CCL3 None ENSG00000277632 6348 protein_coding LD78|G0S19-1|SCYA3|LD78ALPHA|SCI|MIP-1-ALPHA C-C motif chemokine ligand 3 1 11 1 None 2025-03-10 13:27:45.203000+00:00 1 None 1
9734 6XmKAksx3Nn4 CCL3 None ENSG00000278567 6348 protein_coding LD78|G0S19-1|SCYA3|LD78ALPHA|SCI|MIP-1-ALPHA C-C motif chemokine ligand 3 1 11 1 None 2025-03-10 13:27:45.203000+00:00 1 None 1
! records with similar symbols exist! did you mean to load one of them?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
149 44WTiAvlIGf9 TAS2R8 None ENSG00000121314 50836 protein_coding TRB5|T2R8 taste 2 receptor member 8 1 11 1 None 2025-03-10 13:27:44.195000+00:00 1 None 1
150 NtKku75rRrqe TAS2R8 None ENSG00000272712 50836 protein_coding TRB5|T2R8 taste 2 receptor member 8 1 11 1 None 2025-03-10 13:27:44.195000+00:00 1 None 1
151 4ejBtFduCJ3G TAS2R8 None ENSG00000277316 50836 protein_coding TRB5|T2R8 taste 2 receptor member 8 1 11 1 None 2025-03-10 13:27:44.195000+00:00 1 None 1
! record with similar symbol exists! did you mean to load it?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
9164 217N4Hgazwqu JAZF1 None ENSG00000153814 221895 protein_coding TIP27|ZNF802|DKFZP761K2222 JAZF zinc finger 1 1 11 1 None 2025-03-10 13:27:45.152000+00:00 1 None 1
! records with similar symbols exist! did you mean to load one of them?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
11572 TfqHfw8FjZxq CCL4 None ENSG00000275302 6351 protein_coding MIP-1-BETA|SCYA4|AT744.1|LAG1|ACT-2 C-C motif chemokine ligand 4 1 11 1 None 2025-03-10 13:27:45.554000+00:00 1 None 1
11573 2MLaX8EEZ9sQ CCL4 None ENSG00000275824 6351 protein_coding MIP-1-BETA|SCYA4|AT744.1|LAG1|ACT-2 C-C motif chemokine ligand 4 1 11 1 None 2025-03-10 13:27:45.554000+00:00 1 None 1
11574 7P74Aze5sMxb CCL4 None ENSG00000277943 6351 protein_coding MIP-1-BETA|SCYA4|AT744.1|LAG1|ACT-2 C-C motif chemokine ligand 4 1 11 1 None 2025-03-10 13:27:45.554000+00:00 1 None 1
! record with similar symbol exists! did you mean to load it?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
14378 19ZTVOfazMYF SEPTIN14 None ENSG00000154997 346288 protein_coding SEPT14|FLJ44060 septin 14 1 11 1 None 2025-03-10 13:27:45.802000+00:00 1 None 1
! record with similar symbol exists! did you mean to load it?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
6656 2Z2LBAj16utl TRAFD1 None ENSG00000135148 10906 protein_coding FLN29 TRAF-type zinc finger domain containing 1 1 11 1 None 2025-03-10 13:27:44.935000+00:00 1 None 1
! records with similar symbols exist! did you mean to load one of them?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
9907 7TBpxi4zzmMB IGLC3 None ENSG00000211679 IG_C_gene IGLC immunoglobulin lambda constant 3 (Kern-Oz+ mar... 1 11 1 None 2025-03-10 13:27:45.218000+00:00 1 None 1
15352 2qFUCnrrxPlc IGLC7 None ENSG00000211685 IG_C_gene immunoglobulin lambda constant 7 1 11 1 None 2025-03-10 13:27:45.890000+00:00 1 None 1
16480 10LbfkUokHgx IGLC1 None ENSG00000211675 IG_C_gene IGLC immunoglobulin lambda constant 1 1 11 1 None 2025-03-10 13:27:46.190000+00:00 1 None 1
! records with similar symbols exist! did you mean to load one of them?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
149 44WTiAvlIGf9 TAS2R8 None ENSG00000121314 50836 protein_coding TRB5|T2R8 taste 2 receptor member 8 1 11 1 None 2025-03-10 13:27:44.195000+00:00 1 None 1
150 NtKku75rRrqe TAS2R8 None ENSG00000272712 50836 protein_coding TRB5|T2R8 taste 2 receptor member 8 1 11 1 None 2025-03-10 13:27:44.195000+00:00 1 None 1
151 4ejBtFduCJ3G TAS2R8 None ENSG00000277316 50836 protein_coding TRB5|T2R8 taste 2 receptor member 8 1 11 1 None 2025-03-10 13:27:44.195000+00:00 1 None 1
! records with similar symbols exist! did you mean to load one of them?
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype synonyms description space_id source_id organism_id run_id created_at created_by_id _aux _branch_code
id
147 P5rb9Z56wWB2 TRABD2A None ENSG00000186854 129293|105374836 protein_coding TIKI1|C2ORF89 TraB domain containing 2A 1 11 1 None 2025-03-10 13:27:44.195000+00:00 1 None 1
450 6mwpdENkHzjV TRAPPC13 None ENSG00000113597 80006 protein_coding FLJ13611|MGC48585|C5ORF44 trafficking protein particle complex subunit 13 1 11 1 None 2025-03-10 13:27:44.219000+00:00 1 None 1
642 27ggmkQCDtQ9 TRABD2B None ENSG00000269113 388630 protein_coding TIKI2 TraB domain containing 2B 1 11 1 None 2025-03-10 13:27:44.237000+00:00 1 None 1