Look up records of species, gene, protein, cell marker#

Entities and ontologies can be complex with many different identifiers or even species.

Here we show Bionty’s Entity model for species, genes, proteins and cell markers. You’ll see how to

initialize an Entity model with different identifiers
access the reference table via .df
lookup an entity record via .lookup.{term}

import bionty as bt

Species#

To examine the Species ontology we create the corresponding object and look at the associated Pandas DataFrame.

species = bt.Species()

Reference table#

df = species.df()
df.head()

	id	name	scientific_name	division	taxon_id	assembly	assembly_accession	genebuild	variation	microarray	pan_compara	peptide_compara	genome_alignments	other_alignments	core_db	species_id
0	NCBI_80966	spiny chromis	acanthochromis_polyacanthus	EnsemblVertebrates	80966	ASM210954v1	GCA_002109545.1	2018-05-Ensembl/2020-03	N	N	N	Y	Y	Y	acanthochromis_polyacanthus_core_108_1	1
1	NCBI_211598	eurasian sparrowhawk	accipiter_nisus	EnsemblVertebrates	211598	Accipiter_nisus_ver1.0	GCA_004320145.1	2019-07-Ensembl/2019-09	N	N	N	N	N	Y	accipiter_nisus_core_108_1	1
2	NCBI_9646	giant panda	ailuropoda_melanoleuca	EnsemblVertebrates	9646	ASM200744v2	GCA_002007445.2	2020-05-Ensembl/2020-06	N	N	N	Y	Y	Y	ailuropoda_melanoleuca_core_108_2	1
3	NCBI_241587	yellow-billed parrot	amazona_collaria	EnsemblVertebrates	241587	ASM394721v1	GCA_003947215.1	2019-07-Ensembl/2019-09	N	N	N	N	N	Y	amazona_collaria_core_108_1	1
4	NCBI_61819	midas cichlid	amphilophus_citrinellus	EnsemblVertebrates	61819	Midas_v5	GCA_000751415.1	2018-05-Ensembl/2018-07	N	N	N	Y	Y	Y	amphilophus_citrinellus_core_108_5	1

Lookup records#

Terms can be searched with auto-complete using a lookup object:

Tip

By default, the name field is used to generate the lookup, you may change the field via:

species.lookup_field = <new field>

For duplications, we uniquefy them by appending __0, __1, __2, …

lookup = species.lookup()

lookup.white_tufted_ear_marmoset

species(index=37, id='NCBI_9483', name='white-tufted-ear marmoset', scientific_name='callithrix_jacchus', division='EnsemblVertebrates', taxon_id=9483, assembly='mCalJac1.pat.X', assembly_accession='GCA_011100555.1', genebuild='2020-08-Ensembl/2020-11', variation='N', microarray='Y', pan_compara='N', peptide_compara='Y', genome_alignments='Y', other_alignments='Y', core_db='callithrix_jacchus_core_108_1', species_id=1)

lookup.white_tufted_ear_marmoset.scientific_name

'callithrix_jacchus'

To access the information of, for example the human, pig, and mouse species, we select the corresponding species through Pandas:

df = species.df()

df.set_index("name", inplace=True)
df.loc[["human", "mouse", "pig"]]

	id	scientific_name	division	taxon_id	assembly	assembly_accession	genebuild	variation	microarray	pan_compara	peptide_compara	genome_alignments	other_alignments	core_db	species_id
name
human	NCBI_9606	homo_sapiens	EnsemblVertebrates	9606	GRCh38.p13	GCA_000001405.28	2014-01-Ensembl/2022-07	Y	Y	Y	Y	Y	Y	homo_sapiens_core_108_38	1
mouse	NCBI_10090	mus_musculus	EnsemblVertebrates	10090	GRCm39	GCA_000001635.9	2020-08-Ensembl/2022-07	Y	Y	Y	Y	Y	Y	mus_musculus_core_108_39	1
pig	NCBI_9823	sus_scrofa	EnsemblVertebrates	9823	Sscrofa11.1	GCA_000003025.6	2021-09-Ensembl/2022-02	Y	Y	N	Y	Y	Y	sus_scrofa_core_108_111	1

Gene#

Next let’s take a look at genes, which follows the same design choices as Species.

The only difference is the Gene class will initialize with a species parameter, therefore you will only retrieve gene entries of the specified species.

gene = bt.Gene(species="human")

df = gene.df()

df.head()

	id	ensembl_gene_id	symbol	gene_type	description	ncbi_gene_id	hgnc_id	omim_id	synonyms	version
0	Lzl9xt	ENSG00000210049	MT-TF	Mt_tRNA	mitochondrially encoded tRNA-Phe (UUU/C) [Sour...	None	HGNC:7481	None	MTTF\|trnF	Ens107
1	ILAWa7	ENSG00000211459	MT-RNR1	Mt_rRNA	mitochondrially encoded 12S rRNA [Source:HGNC ...	None	HGNC:7470	None	12S\|MOTS-c\|MTRNR1	Ens107
2	XkyeQz	ENSG00000210077	MT-TV	Mt_tRNA	mitochondrially encoded tRNA-Val (GUN) [Source...	None	HGNC:7500	None	MTTV\|trnV	Ens107
3	jDD2jW	ENSG00000210082	MT-RNR2	Mt_rRNA	mitochondrially encoded 16S rRNA [Source:HGNC ...	None	HGNC:7471	None	16S\|HN\|MTRNR2	Ens107
4	J58H9b	ENSG00000209082	MT-TL1	Mt_tRNA	mitochondrially encoded tRNA-Leu (UUA/G) 1 [So...	None	HGNC:7490	None	MTTL1\|TRNL1	Ens107

lookup = gene.lookup()

lookup.TCF7

gene(index=20388, id='sXCrmQ', ensembl_gene_id='ENSG00000081059', symbol='TCF7', gene_type='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', ncbi_gene_id='6932', hgnc_id='HGNC:11639', omim_id='189908', synonyms='TCF-1', version='Ens107')

Convert between identifiers just using Pandas:

df.loc[df["symbol"].isin(["BRCA1", "BRCA2"])]

	id	ensembl_gene_id	symbol	gene_type	description	ncbi_gene_id	hgnc_id	omim_id	synonyms	version
17731	nLEreh	ENSG00000139618	BRCA2	protein_coding	BRCA2 DNA repair associated [Source:HGNC Symbo...	675	HGNC:1101	600185	BRCC2\|FACD\|FAD\|FAD1\|FANCD\|FANCD1\|XRCC11	Ens107
63779	9FY8yO	ENSG00000012048	BRCA1	protein_coding	BRCA1 DNA repair associated [Source:HGNC Symbo...	672	HGNC:1100	113705	BRCC1\|FANCS\|PPP1R53\|RNF53	Ens107

The mouse reference is also available from ensembl:

gene = bt.Gene("mouse")

df = gene.df()
df.head()

	id	ensembl_gene_id	symbol	gene_type	description	ncbi_gene_id	mgi_id	synonyms	version
0	Epd98t	ENSMUSG00000064336	mt-Tf	Mt_tRNA	mitochondrially encoded tRNA phenylalanine [So...	None	MGI:102487	tRNA\|tRNA-Phe\|TrnF tRNA	Ens107
1	RiOxA6	ENSMUSG00000064337	mt-Rnr1	Mt_rRNA	mitochondrially encoded 12S rRNA [Source:MGI S...	None	MGI:102493	12S ribosomal RNA\|12S rRNA\|12SrRNA\|Rnr1 s-rRNA	Ens107
2	cMIElg	ENSMUSG00000064338	mt-Tv	Mt_tRNA	mitochondrially encoded tRNA valine [Source:MG...	None	MGI:102472	tRNA\|tRNA-Val\|TrnaV tRNA	Ens107
3	DbiNNA	ENSMUSG00000064339	mt-Rnr2	Mt_rRNA	mitochondrially encoded 16S rRNA [Source:MGI S...	None	MGI:102492	16S ribosomal RNA\|16S rRNA\|16SrRNA\|Rnr2 16S ri...	Ens107
4	NO6NBF	ENSMUSG00000064340	mt-Tl1	Mt_tRNA	mitochondrially encoded tRNA leucine 1 [Source...	None	MGI:102482	tRNA\|tRNA Leu\|tRNA Leu_1\|TrnrL1 tRNA	Ens107

Protein#

The protein reference uses UniProt id as the standardized identifier.

protein = bt.Protein(species="human")

lookup = protein.lookup()

lookup.ABC_transporter_domain_containing_protein

protein(index=197375, id='7Hevwtc', uniprotkb_id='Q9BV39', uniprotkb_name='Q9BV39_HUMAN', synonyms='ABC transporter domain-containing protein', length=316, species_id=9606, gene_symbols=None, gene_synonyms=None, ensembl_transcript_ids=None, ncbi_gene_ids=None, name='ABC transporter domain-containing protein')

df = protein.df()
df.head()

	id	uniprotkb_id	uniprotkb_name	synonyms	length	species_id	gene_symbols	gene_synonyms	ensembl_transcript_ids	ncbi_gene_ids	name
0	1zrr8Wy	A0A024QZ08	A0A024QZ08_HUMAN	Intraflagellar transport 20 homolog (Chlamydom...	132	9606	IFT20	None	None	90410;	isoform CRA_c
1	xNgxtFu	A0A024QZ86	A0A024QZ86_HUMAN	T-box 2\|isoform CRA_a	712	9606	TBX2	None	None	6909;	T-box 2
2	X9K8OgK	A0A024QZA8	A0A024QZA8_HUMAN	Receptor protein-tyrosine kinase\|EC 2.7.10.1	976	9606	EPHA2	None	None	1969;	EC 2.7.10.1
3	8jW9Ci4	A0A024QZB8	A0A024QZB8_HUMAN	Battenin	438	9606	CLN3	None	None	1201;	Battenin
4	nZNsA6F	A0A024QZQ1	A0A024QZQ1_HUMAN	Sirtuin (Silent mating type information regula...	747	9606	SIRT1	None	None	23411;	isoform CRA_a

Cell marker#

The cell marker ontologies works similarly.

cell_marker = bt.CellMarker(species="human")

df = cell_marker.df()
df.head()

	id	name	ncbi_gene_id	gene_symbol	gene_name	uniprotkb_id	synonyms
0	CM_MERTK	MERTK	10461	MERTK	MER proto-oncogene, tyrosine kinase	Q12866	None
1	CM_CD16	CD16	2215	FCGR3A	Fc fragment of IgG receptor IIIb	O75015	None
2	CM_CD206	CD206	4360	MRC1	mannose receptor C-type 1	P22897	None
3	CM_CRIg	CRIg	11326	VSIG4	V-set and immunoglobulin domain containing 4	Q9Y279	None
4	CM_CD163	CD163	9332	CD163	CD163 molecule	Q86VB7	None

lookup = cell_marker.lookup()

lookup.CD45

cell_marker(index=34, id='CM_CD45', name='CD45', ncbi_gene_id='5788', gene_symbol='PTPRC', gene_name='protein tyrosine phosphatase receptor type C', uniprotkb_id='M9MML4', synonyms=None)