技術雜談：Query PubMed with Regular Expression

The Entrez Programming Utilities are a interface to Entrez query and database system at the National Center for Biotechnology Information (NCBI). With E-Utilities, we can write programs to query NCBI databases. Since the interface is a fixed URL syntax, we can query NCBI databases in any programming language. Here we demo a Perl script to query PubMed and filter the result with regular expression.

The program is placed on Github. If you are interested in the script, you may follow the script on Github. The program is inspared by the post by cfrenz. You may also check the original script there.

Basically, there are two steps:

Query PubMed with esearch to get essential parameters like query key, webdev and maximal result count.
Query PubMed in batch mode with efetch.

Here we use Mojo::UserAgent to do the HTTP GET action. The first step is like this:


my $query = "schizophrenia AND clinical trial";
my $baseurl="http://www.ncbi.nlm.nih.gov/entrez/eutils/";
my $ua = Mojo::UserAgent->new();
my $tx = $ua->get($baseurl . "esearch.fcgi?",
                  form => { db => 'Pubmed', retmax => 1,
                            usehistory => 'y', term => $query });
die "Failed connection\n" unless $tx->success;
my $response = $tx->success;
my $results = $response->body;
$results =~ /<Count>(\d+)<\/Count>/;
my $num_abstracts=$1;
$results =~ /<QueryKey>(\d+)<\/QueryKey>/;
my $query_key=$1;
$results =~ /<WebEnv>(.*?)<\/WebEnv>/;
my $web_env=$1;
{{< / highlight >}}

After getting the parameters, do next step like this:

```perl
my $retmax = 500; # batch mode. $retmax should be less than 10,000.
for (my $restart = 0; $restart <= $num_abstracts; $restart += $retmax) {
    $tx = $ua->get($baseurl . "efetch.fcgi?",
          form => { db => "pubmed", WebEnv => $web_env,
                    query_key => $query_key, rettype => 'abstract',
                    restart => $restart, retmax => $retmax, });
    if (!$tx->success) {
        warn "Failed connection";
        next;
    }
    my $response = $tx->success;
    my $text = $response->text;
    my @contents = $text =~ m{(<PubmedArticle>.*?</PubmedArticle>)}gs;
    for my $content (@contents) {
        # do more things here
    }
}
{{< / highlight >}}

I tried **Mojo::DOM** but the speed was not acceptable due to the volume of data.  Therefore, I shifted to regular expression.  Although regular expression is not a real parsing tool, its speed is satisfactory.  I tried BioPerl but it failed to work.  Therefore I re-writed the query by myself.  You may try E-Utilities and work out your own scripts.

關於作者

位元詩人 (ByteBard) 是資訊領域碩士，喜歡用開源技術來解決各式各樣的問題。這類技術跨平台、重用性高、技術生命長。

除了開源技術以外，位元詩人喜歡日本料理和黑咖啡，會一些日文，有時會自助旅行。