Chapter 6 Bio.Entrez: Accessing NCBI's Entrez databases
Entrez (http://www.ncbi.nlm.nih.gov/Entrez) is a data retrieval system that provides users access to NCBI's databases such as PubMed, GenBank, GEO, and many others. You can access Entrez from a web browser to manually enter queries, or you can use Biopython's Bio.Entrez
module for programmatic access to Entrez. The latter allows you for example to search PubMed or download GenBank records from within a Python script.
The Bio.Entrez
module makes use of the Entrez Programming Utilities, consisting of eight tools that are described in detail on NCBI's page at http://www.ncbi.nlm.nih.gov/entrez/utils/. Each of these tools corresponds to one Python function in the Bio.Entrez
module, as described in the sections below. This module makes sure that the correct URL is used for the queries, and that not more than one request is made every three seconds, as required by NCBI.
The output returned by the Entrez Programming Utilities is typically in XML format. Currently, Biopython does not contain parsers for the XML output generated by the Entrez Programming Utilities. However, if you know what you're looking for, it is fairly easy to pull out the information you need from the XML output. For sequence databases, the Entrez Programming Utilities can also generate output in other formats (such as the Fasta and GenBank file format). This can then be parsed into a SeqRecord using Bio.SeqIO
(see Chapter 4, and the example below).
6.1 EInfo: Obtaining information about the Entrez databases
EInfo provides field index term counts, last update, and available links for each of NCBI's databases. In addition, you can use EInfo to obtain a list of all database names accessible through the Entrez utilities:
>>> from Bio import Entrez
>>> handle = Entrez.einfo()
>>> result = handle.read()
The variable result
now contains a list of databases in XML format:
>>> print result
<?xml version="1.0"?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbList>
<DbName>pubmed</DbName>
<DbName>protein</DbName>
<DbName>nucleotide</DbName>
<DbName>nuccore</DbName>
<DbName>nucgss</DbName>
<DbName>nucest</DbName>
<DbName>structure</DbName>
<DbName>genome</DbName>
<DbName>books</DbName>
<DbName>cancerchromosomes</DbName>
<DbName>cdd</DbName>
<DbName>gap</DbName>
<DbName>domains</DbName>
<DbName>gene</DbName>
<DbName>genomeprj</DbName>
<DbName>gensat</DbName>
<DbName>geo</DbName>
<DbName>gds</DbName>
<DbName>homologene</DbName>
<DbName>journals</DbName>
<DbName>mesh</DbName>
<DbName>ncbisearch</DbName>
<DbName>nlmcatalog</DbName>
<DbName>omia</DbName>
<DbName>omim</DbName>
<DbName>pmc</DbName>
<DbName>popset</DbName>
<DbName>probe</DbName>
<DbName>proteinclusters</DbName>
<DbName>pcassay</DbName>
<DbName>pccompound</DbName>
<DbName>pcsubstance</DbName>
<DbName>snp</DbName>
<DbName>taxonomy</DbName>
<DbName>toolkit</DbName>
<DbName>unigene</DbName>
<DbName>unists</DbName>
</DbList>
</eInfoResult>
For each of these databases, we can use EInfo again to obtain more information:
>>> handle = Entrez.einfo(db="pubmed")
>>> print handle.read()
<?xml version="1.0"?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbInfo>
<DbName>pubmed</DbName>
<MenuName>PubMed</MenuName>
<Description>PubMed bibliographic record</Description>
<Count>17781992</Count>
<LastUpdate>2008/02/18 01:22</LastUpdate>
<FieldList>
<Field>
<Name>ALL</Name>
...
6.2 ESearch: Searching the Entrez databases
To search any of these databases, we use Bio.Entrez.esearch()
. For example, let's search in PubMed for publications related to Biopython:
>>> from Bio import Entrez
>>> handle = Entrez.esearch(db="pubmed", term="biopython")
>>> print handle.read()
<?xml version="1.0"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
<Count>5</Count>
<RetMax>5</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>16403221</Id>
<Id>16377612</Id>
<Id>14871861</Id>
<Id>14630660</Id>
<Id>12230038</Id>
</IdList>
<TranslationSet>
</TranslationSet>
<TranslationStack>
<TermSet>
<Term>biopython[All Fields]</Term>
<Field>All Fields</Field>
<Count>5</Count>
<Explode>Y</Explode>
</TermSet>
<OP>GROUP</OP>
</TranslationStack>
<QueryTranslation>biopython[All Fields]</QueryTranslation>
</eSearchResult>
In this output, you see five PubMed IDs (16403221, 16377612, 14871861, 14630660, 12230038), which can be retrieved by EFetch (see section 6.5).
You can also use ESearch to search GenBank. Here we'll do a quick search for the rpl16 gene in Opuntia:
>>> handle = Entrez.esearch(db="nucleotide",term="Opuntia and rpl16")
>>> print handle.read()
<?xml version="1.0"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
<Count>9</Count>
<RetMax>9</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>57240072</Id>
<Id>57240071</Id>
<Id>6273287</Id>
<Id>6273291</Id>
<Id>6273290</Id>
<Id>6273289</Id>
<Id>6273286</Id>
<Id>6273285</Id>
<Id>6273284</Id>
</IdList>
<TranslationSet>
</TranslationSet>
<QueryTranslation></QueryTranslation>
</eSearchResult>
Each of the IDs (<Id>57240072</Id>
, ...) is a GenBank identifier. See section 6.5 for information on how to actually download these GenBank records.
As a final example, let's get a list of computational journal titles:
>>> handle = Entrez.esearch(db="journals", term="computational")
>>> print handle.read()
<?xml version="1.0"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
<Count>15</Count>
<RetMax>15</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>30367</Id>
<Id>33843</Id>
<Id>33823</Id>
<Id>32989</Id>
<Id>33190</Id>
<Id>33009</Id>
<Id>31986</Id>
<Id>8799</Id>
<Id>22857</Id>
<Id>32675</Id>
<Id>20258</Id>
<Id>33859</Id>
<Id>32534</Id>
<Id>32357</Id>
<Id>32249</Id>
</IdList>
<TranslationSet>
</TranslationSet>
<TranslationStack>
<TermSet>
<Term>computational[All Fields]</Term>
<Field>All Fields</Field>
<Count>15</Count>
<Explode>Y</Explode>
</TermSet>
<OP>GROUP</OP>
</TranslationStack>
<QueryTranslation>computational[All Fields]</QueryTranslation>
</eSearchResult>
Again, we could use EFetch to obtain more information for each of these journal IDs.
ESearch has many useful options --- see the ESearch help page for more information.
EPost posts a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through Bio.Entrez.epost()
.
6.4 ESummary: Retrieving summaries from primary IDs
ESummary retrieves document summaries from a list of primary IDs (see the ESummary help page for more information). In Biopython, ESummary is available as Bio.Entrez.esummary()
. Using the search result above, we can for example find out more about the journal with ID 30367:
>>> from Bio import Entrez
>>> handle = Entrez.esummary(db="journals", id="30367")
>>> print handle.read()
<?xml version="1.0"?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 29 October 2004//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_041029.dtd">
<eSummaryResult>
<DocSum>
<Id>30367</Id>
<Item Name="Title" Type="String">Computational biology and chemistry</Item>
<Item Name="MedAbbr" Type="String">Comput Biol Chem</Item>
<Item Name="IsoAbbr" Type="String"></Item>
<Item Name="NlmId" Type="String">101157394</Item>
<Item Name="pISSN" Type="String">1476-9271</Item>
<Item Name="eISSN" Type="String"></Item>
<Item Name="PublicationStartYear" Type="String">2003</Item>
<Item Name="PublicationEndYear" Type="String"></Item>
<Item Name="Publisher" Type="String">Pergamon,</Item>
<Item Name="Language" Type="String">eng</Item>
<Item Name="Country" Type="String">England</Item>
<Item Name="BroadHeading" Type="List">
<Item Name="string" Type="String">Biology</Item>
<Item Name="string" Type="String">Chemistry</Item>
<Item Name="string" Type="String">Medical Informatics</Item>
</Item>
<Item Name="ContinuationNotes" Type="String">Continues: Computers & chemistry. </Item>
</DocSum>
</eSummaryResult>
6.5 EFetch: Downloading full records from Entrez
EFetch is what you use when you want to retrieve a full record from Entrez.
For the Opuntia example above, we can download GenBank record 57240072 using Bio.Entrez.efetch
:
>>> handle = Entrez.efetch(db="nucleotide", id="57240072",rettype="genbank")
>>> print handle.read()
LOCUS AY851612 892 bp DNA linear PLN 10-APR-2007
DEFINITION Opuntia subulata rpl16 gene, intron; chloroplast.
ACCESSION AY851612
VERSION AY851612.1 GI:57240072
KEYWORDS .
SOURCE chloroplast Austrocylindropuntia subulata
ORGANISM Austrocylindropuntia subulata
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
Caryophyllales; Cactaceae; Opuntioideae; Austrocylindropuntia.
REFERENCE 1 (bases 1 to 892)
AUTHORS Butterworth,C.A. and Wallace,R.S.
TITLE Molecular Phylogenetics of the Leafy Cactus Genus Pereskia
(Cactaceae)
JOURNAL Syst. Bot. 30 (4), 800-808 (2005)
REFERENCE 2 (bases 1 to 892)
AUTHORS Butterworth,C.A. and Wallace,R.S.
TITLE Direct Submission
JOURNAL Submitted (10-DEC-2004) Desert Botanical Garden, 1201 North Galvin
Parkway, Phoenix, AZ 85008, USA
FEATURES Location/Qualifiers
source 1..892
/organism="Austrocylindropuntia subulata"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/db_xref="taxon:106982"
gene <1..>892
/gene="rpl16"
intron <1..>892
/gene="rpl16"
ORIGIN
1 cattaaagaa gggggatgcg gataaatgga aaggcgaaag aaagaaaaaa atgaatctaa
61 atgatatacg attccactat gtaaggtctt tgaatcatat cataaaagac aatgtaataa
121 agcatgaata cagattcaca cataattatc tgatatgaat ctattcatag aaaaaagaaa
181 aaagtaagag cctccggcca ataaagacta agagggttgg ctcaagaaca aagttcatta
241 agagctccat tgtagaattc agacctaatc attaatcaag aagcgatggg aacgatgtaa
301 tccatgaata cagaagattc aattgaaaaa gatcctaatg atcattggga aggatggcgg
361 aacgaaccag agaccaattc atctattctg aaaagtgata aactaatcct ataaaactaa
421 aatagatatt gaaagagtaa atattcgccc gcgaaaattc cttttttatt aaattgctca
481 tattttattt tagcaatgca atctaataaa atatatctat acaaaaaaat atagacaaac
541 tatatatata taatatattt caaatttcct tatataccca aatataaaaa tatctaataa
601 attagatgaa tatcaaagaa tctattgatt tagtgtatta ttaaatgtat atcttaattc
661 aatattatta ttctattcat ttttattcat tttcaaattt ataatatatt aatctatata
721 ttaatttata attctattct aattcgaatt caatttttaa atattcatat tcaattaaaa
781 ttgaaatttt ttcattcgcg aggagccgga tgagaagaaa ctctcatgtc cggttctgta
841 gtagagatgg aattaagaaa aaaccatcaa ctataacccc aagagaacca ga
//
The argument rettype="genbank"
lets us download this record in the GenBank format. Alternatively, you could for example use rettype="fasta"
to get the Fasta-format; see the EFetch Help page for other options. The available formats depend on which database you are downloading from.
If you fetch the record in one of the formats accepted by Bio.SeqIO
(see Chapter 4), you can directly parse it into a SeqRecord
:
>>> from Bio import Entrez, SeqIO
>>> handle = Entrez.efetch(db="nucleotide", id="57240072",rettype="genbank")
>>> record = SeqIO.read(handle, "genbank")
>>> print record
ID: AY851612.1
Name: AY851612
Desription: Opuntia subulata rpl16 gene, intron; chloroplast.
/sequence_version=1
/source=chloroplast Austrocylindropuntia subulata
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'eudicotyledons', 'core eudicotyledons', 'Caryophyllales', 'Cactaceae', 'Opuntioideae', 'Austrocylindropuntia']
/keywords=['']
/references=[<Bio.SeqFeature.Reference instance at 0x141d3a0>, <Bio.SeqFeature.Reference instance at 0x14173a0>]
/accessions=['AY851612']
/data_file_division=PLN
/date=10-APR-2007
/organism=Austrocylindropuntia subulata
/gi=57240072
Seq('CATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGA...AGA', IUPACAmbiguousDNA())
For help on ELink, see the ELink help page. ELink is available from Biopython through Bio.Entrez.elink()
.
6.7 EGQuery: Obtaining counts for search terms
EGQuery provides counts for a search term in each of the Entrez databases. In this example, we use Bio.Entrez.egquery()
to obtain the counts for ``Biopython'':
>>> handle = Entrez.egquery(term="biopython")
>>> print handle.read()
<?xml version="1.0"?>
<!DOCTYPE Result PUBLIC "-//NLM//DTD eSearchResult, January 2004//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/egquery.dtd">
<Result>
<Term>biopython</Term>
<eGQueryResult>
<ResultItem>
<DbName>pubmed</DbName>
<MenuName>PubMed</MenuName>
<Count>706</Count>
<Status>Ok</Status>
</ResultItem>
<ResultItem>
<DbName>pmc</DbName>
<MenuName>PMC</MenuName>
<Count>359</Count>
<Status>Ok</Status>
</ResultItem>
...
See the EGQuery help page for more information.
6.8 ESpell: Obtaining spelling suggestions
ESpell retrieves spelling suggestions. In this example, we use Bio.Entrez.espell()
to obtain the correct spelling of Biopython:
>>> from Bio import Entrez
>>> handle = Entrez.espell(term="biopythooon")
>>> print handle.read()
<eSpellResult>
<Database>pubmed</Database>
<Query>biopythooon</Query>
<CorrectedQuery>biopython</CorrectedQuery>
<SpelledQuery><Replaced>biopython</Replaced></SpelledQuery>
<ERROR/>
</eSpellResult>
See the ESpell help page for more information.
6.9 Creating web links to the Entrez databases
In addition to the eight Entrez Programming Utilities, you can also create URLs to information of the Entrez databases in HTML format. This is primarily intended to create links or bookmarks to the Entrez databases. To do so, you can use the function Bio.Entrez.query
. Detailed information of this service is available from http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helplinks.chapter.linkshelpNCBI. For heavy usage of the NCBI databases, please use the Entrez Programming Utilities instead of Bio.Entrez.query
.