Michelle Chen 技術雜談:Query PubMed with Regular Expression

Facebook Twitter LinkedIn LINE Skype EverNote GMail Yahoo Email

The Entrez Programming Utilities are a interface to Entrez query and database system at the National Center for Biotechnology Information (NCBI). With E-Utilities, we can write programs to query NCBI databases. Since the interface is a fixed URL syntax, we can query NCBI databases in any programming language. Here we demo a Perl script to query PubMed and filter the result with regular expression.

The program is placed on Github. If you are interested in the script, you may follow the script on Github. The program is inspared by the post by cfrenz. You may also check the original script there.

Basically, there are two steps:

  1. Query PubMed with esearch to get essential parameters like query key, webdev and maximal result count.
  2. Query PubMed in batch mode with efetch.

Here we use Mojo::UserAgent to do the HTTP GET action. The first step is like this:

my $query = "schizophrenia AND clinical trial";
my $baseurl="http://www.ncbi.nlm.nih.gov/entrez/eutils/";
my $ua = Mojo::UserAgent->new();
my $tx = $ua->get($baseurl . "esearch.fcgi?",
                  form => { db => 'Pubmed', retmax => 1,
                            usehistory => 'y', term => $query });
die "Failed connection\n" unless $tx->success;
my $response = $tx->success;
my $results = $response->body;
$results =~ /<Count>(\d+)<\/Count>/;
my $num_abstracts=$1;
$results =~ /<QueryKey>(\d+)<\/QueryKey>/;
my $query_key=$1;
$results =~ /<WebEnv>(.*?)<\/WebEnv>/;
my $web_env=$1;
{{< / highlight >}}

After getting the parameters, do next step like this:

my $retmax = 500; # batch mode. $retmax should be less than 10,000.
for (my $restart = 0; $restart <= $num_abstracts; $restart += $retmax) {
    $tx = $ua->get($baseurl . "efetch.fcgi?",
          form => { db => "pubmed", WebEnv => $web_env,
                    query_key => $query_key, rettype => 'abstract',
                    restart => $restart, retmax => $retmax, });
    if (!$tx->success) {
        warn "Failed connection";
    my $response = $tx->success;
    my $text = $response->text;
    my @contents = $text =~ m{(<PubmedArticle>.*?</PubmedArticle>)}gs;
    for my $content (@contents) {
        # do more things here
{{< / highlight >}}

I tried **Mojo::DOM** but the speed was not acceptable due to the volume of data.  Therefore, I shifted to regular expression.  Although regular expression is not a real parsing tool, its speed is satisfactory.  I tried BioPerl but it failed to work.  Therefore I re-writed the query by myself.  You may try E-Utilities and work out your own scripts.