Look up records of species, gene, protein, cell marker#

Entities and ontologies can be complex with many different identifiers or even species.

Here we show Bionty’s Entity model for species, genes, proteins and cell markers. You’ll see how to

  • initialize an Entity model with different identifiers

  • access the reference table via .df

  • lookup an entity record via .lookup.{term}

import bionty as bt

Species#

To examine the Species ontology we create the corresponding object and look at the associated Pandas DataFrame.

species = bt.Species()

Reference table#

df = species.df()
df.head()
id name scientific_name division taxon_id assembly assembly_accession genebuild variation microarray pan_compara peptide_compara genome_alignments other_alignments core_db species_id
0 NCBI_80966 spiny chromis acanthochromis_polyacanthus EnsemblVertebrates 80966 ASM210954v1 GCA_002109545.1 2018-05-Ensembl/2020-03 N N N Y Y Y acanthochromis_polyacanthus_core_108_1 1
1 NCBI_211598 eurasian sparrowhawk accipiter_nisus EnsemblVertebrates 211598 Accipiter_nisus_ver1.0 GCA_004320145.1 2019-07-Ensembl/2019-09 N N N N N Y accipiter_nisus_core_108_1 1
2 NCBI_9646 giant panda ailuropoda_melanoleuca EnsemblVertebrates 9646 ASM200744v2 GCA_002007445.2 2020-05-Ensembl/2020-06 N N N Y Y Y ailuropoda_melanoleuca_core_108_2 1
3 NCBI_241587 yellow-billed parrot amazona_collaria EnsemblVertebrates 241587 ASM394721v1 GCA_003947215.1 2019-07-Ensembl/2019-09 N N N N N Y amazona_collaria_core_108_1 1
4 NCBI_61819 midas cichlid amphilophus_citrinellus EnsemblVertebrates 61819 Midas_v5 GCA_000751415.1 2018-05-Ensembl/2018-07 N N N Y Y Y amphilophus_citrinellus_core_108_5 1

Lookup records#

Terms can be searched with auto-complete using a lookup object:

Tip

By default, the name field is used to generate the lookup, you may change the field via:

species.lookup_field = <new field>

For duplications, we uniquefy them by appending __0, __1, __2, …

lookup = species.lookup()
lookup.white_tufted_ear_marmoset
species(index=37, id='NCBI_9483', name='white-tufted-ear marmoset', scientific_name='callithrix_jacchus', division='EnsemblVertebrates', taxon_id=9483, assembly='mCalJac1.pat.X', assembly_accession='GCA_011100555.1', genebuild='2020-08-Ensembl/2020-11', variation='N', microarray='Y', pan_compara='N', peptide_compara='Y', genome_alignments='Y', other_alignments='Y', core_db='callithrix_jacchus_core_108_1', species_id=1)
lookup.white_tufted_ear_marmoset.scientific_name
'callithrix_jacchus'

To access the information of, for example the human, pig, and mouse species, we select the corresponding species through Pandas:

df = species.df()
df.set_index("name", inplace=True)
df.loc[["human", "mouse", "pig"]]
id scientific_name division taxon_id assembly assembly_accession genebuild variation microarray pan_compara peptide_compara genome_alignments other_alignments core_db species_id
name
human NCBI_9606 homo_sapiens EnsemblVertebrates 9606 GRCh38.p13 GCA_000001405.28 2014-01-Ensembl/2022-07 Y Y Y Y Y Y homo_sapiens_core_108_38 1
mouse NCBI_10090 mus_musculus EnsemblVertebrates 10090 GRCm39 GCA_000001635.9 2020-08-Ensembl/2022-07 Y Y Y Y Y Y mus_musculus_core_108_39 1
pig NCBI_9823 sus_scrofa EnsemblVertebrates 9823 Sscrofa11.1 GCA_000003025.6 2021-09-Ensembl/2022-02 Y Y N Y Y Y sus_scrofa_core_108_111 1

Gene#

Next let’s take a look at genes, which follows the same design choices as Species.

The only difference is the Gene class will initialize with a species parameter, therefore you will only retrieve gene entries of the specified species.

gene = bt.Gene(species="human")
df = gene.df()
df.head()
id ensembl_gene_id symbol gene_type description ncbi_gene_id hgnc_id omim_id synonyms version
0 Lzl9xt ENSG00000210049 MT-TF Mt_tRNA mitochondrially encoded tRNA-Phe (UUU/C) [Sour... None HGNC:7481 None MTTF|trnF Ens107
1 ILAWa7 ENSG00000211459 MT-RNR1 Mt_rRNA mitochondrially encoded 12S rRNA [Source:HGNC ... None HGNC:7470 None 12S|MOTS-c|MTRNR1 Ens107
2 XkyeQz ENSG00000210077 MT-TV Mt_tRNA mitochondrially encoded tRNA-Val (GUN) [Source... None HGNC:7500 None MTTV|trnV Ens107
3 jDD2jW ENSG00000210082 MT-RNR2 Mt_rRNA mitochondrially encoded 16S rRNA [Source:HGNC ... None HGNC:7471 None 16S|HN|MTRNR2 Ens107
4 J58H9b ENSG00000209082 MT-TL1 Mt_tRNA mitochondrially encoded tRNA-Leu (UUA/G) 1 [So... None HGNC:7490 None MTTL1|TRNL1 Ens107
lookup = gene.lookup()
lookup.TCF7
gene(index=20388, id='sXCrmQ', ensembl_gene_id='ENSG00000081059', symbol='TCF7', gene_type='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', ncbi_gene_id='6932', hgnc_id='HGNC:11639', omim_id='189908', synonyms='TCF-1', version='Ens107')

Convert between identifiers just using Pandas:

df.loc[df["symbol"].isin(["BRCA1", "BRCA2"])]
id ensembl_gene_id symbol gene_type description ncbi_gene_id hgnc_id omim_id synonyms version
17731 nLEreh ENSG00000139618 BRCA2 protein_coding BRCA2 DNA repair associated [Source:HGNC Symbo... 675 HGNC:1101 600185 BRCC2|FACD|FAD|FAD1|FANCD|FANCD1|XRCC11 Ens107
63779 9FY8yO ENSG00000012048 BRCA1 protein_coding BRCA1 DNA repair associated [Source:HGNC Symbo... 672 HGNC:1100 113705 BRCC1|FANCS|PPP1R53|RNF53 Ens107

The mouse reference is also available from ensembl:

gene = bt.Gene("mouse")
df = gene.df()
df.head()
id ensembl_gene_id symbol gene_type description ncbi_gene_id mgi_id synonyms version
0 Epd98t ENSMUSG00000064336 mt-Tf Mt_tRNA mitochondrially encoded tRNA phenylalanine [So... None MGI:102487 tRNA|tRNA-Phe|TrnF tRNA Ens107
1 RiOxA6 ENSMUSG00000064337 mt-Rnr1 Mt_rRNA mitochondrially encoded 12S rRNA [Source:MGI S... None MGI:102493 12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA Ens107
2 cMIElg ENSMUSG00000064338 mt-Tv Mt_tRNA mitochondrially encoded tRNA valine [Source:MG... None MGI:102472 tRNA|tRNA-Val|TrnaV tRNA Ens107
3 DbiNNA ENSMUSG00000064339 mt-Rnr2 Mt_rRNA mitochondrially encoded 16S rRNA [Source:MGI S... None MGI:102492 16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri... Ens107
4 NO6NBF ENSMUSG00000064340 mt-Tl1 Mt_tRNA mitochondrially encoded tRNA leucine 1 [Source... None MGI:102482 tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA Ens107

Protein#

The protein reference uses UniProt id as the standardized identifier.

protein = bt.Protein(species="human")
lookup = protein.lookup()
lookup.ABC_transporter_domain_containing_protein
protein(index=197375, id='7Hevwtc', uniprotkb_id='Q9BV39', uniprotkb_name='Q9BV39_HUMAN', synonyms='ABC transporter domain-containing protein', length=316, species_id=9606, gene_symbols=None, gene_synonyms=None, ensembl_transcript_ids=None, ncbi_gene_ids=None, name='ABC transporter domain-containing protein')
df = protein.df()
df.head()
id uniprotkb_id uniprotkb_name synonyms length species_id gene_symbols gene_synonyms ensembl_transcript_ids ncbi_gene_ids name
0 1zrr8Wy A0A024QZ08 A0A024QZ08_HUMAN Intraflagellar transport 20 homolog (Chlamydom... 132 9606 IFT20 None None 90410; isoform CRA_c
1 xNgxtFu A0A024QZ86 A0A024QZ86_HUMAN T-box 2|isoform CRA_a 712 9606 TBX2 None None 6909; T-box 2
2 X9K8OgK A0A024QZA8 A0A024QZA8_HUMAN Receptor protein-tyrosine kinase|EC 2.7.10.1 976 9606 EPHA2 None None 1969; EC 2.7.10.1
3 8jW9Ci4 A0A024QZB8 A0A024QZB8_HUMAN Battenin 438 9606 CLN3 None None 1201; Battenin
4 nZNsA6F A0A024QZQ1 A0A024QZQ1_HUMAN Sirtuin (Silent mating type information regula... 747 9606 SIRT1 None None 23411; isoform CRA_a

Cell marker#

The cell marker ontologies works similarly.

cell_marker = bt.CellMarker(species="human")
df = cell_marker.df()
df.head()
id name ncbi_gene_id gene_symbol gene_name uniprotkb_id synonyms
0 CM_MERTK MERTK 10461 MERTK MER proto-oncogene, tyrosine kinase Q12866 None
1 CM_CD16 CD16 2215 FCGR3A Fc fragment of IgG receptor IIIb O75015 None
2 CM_CD206 CD206 4360 MRC1 mannose receptor C-type 1 P22897 None
3 CM_CRIg CRIg 11326 VSIG4 V-set and immunoglobulin domain containing 4 Q9Y279 None
4 CM_CD163 CD163 9332 CD163 CD163 molecule Q86VB7 None
lookup = cell_marker.lookup()
lookup.CD45
cell_marker(index=34, id='CM_CD45', name='CD45', ncbi_gene_id='5788', gene_symbol='PTPRC', gene_name='protein tyrosine phosphatase receptor type C', uniprotkb_id='M9MML4', synonyms=None)