技術雜談：Query PubMed with Regular Expression

The Entrez Programming Utilities are a interface to Entrez query and database system at the National Center for Biotechnology Information (NCBI). With E-Utilities, we can write programs to query NCBI databases. Since the interface is a fixed URL syntax, we can query NCBI databases in any programming language. Here we demo a Perl script to query PubMed and filter the result with regular expression.

The program is placed on Github. If you are interested in the script, you may follow the script on Github. The program is inspared by the post by cfrenz. You may also check the original script there.

Basically, there are two steps:

Query PubMed with esearch to get essential parameters like query key, webdev and maximal result count.
Query PubMed in batch mode with efetch.

Here we use Mojo::UserAgent to do the HTTP GET action. The first step is like this:


my $query = "schizophrenia AND clinical trial";
my $baseurl="http://www.ncbi.nlm.nih.gov/entrez/eutils/";
my $ua = Mojo::UserAgent->new();
my $tx = $ua->get($baseurl . "esearch.fcgi?",
                  form => { db => 'Pubmed', retmax => 1,
                            usehistory => 'y', term => $query });
die "Failed connection\n" unless $tx->success;
my $response = $tx->success;
my $results = $response->body;
$results =~ /<Count>(\d+)<\/Count>/;
my $num_abstracts=$1;
$results =~ /<QueryKey>(\d+)<\/QueryKey>/;
my $query_key=$1;
$results =~ /<WebEnv>(.*?)<\/WebEnv>/;
my $web_env=$1;
{{< / highlight >}}

After getting the parameters, do next step like this:

```perl
my $retmax = 500; # batch mode. $retmax should be less than 10,000.
for (my $restart = 0; $restart <= $num_abstracts; $restart += $retmax) {
    $tx = $ua->get($baseurl . "efetch.fcgi?",
          form => { db => "pubmed", WebEnv => $web_env,
                    query_key => $query_key, rettype => 'abstract',
                    restart => $restart, retmax => $retmax, });
    if (!$tx->success) {
        warn "Failed connection";
        next;
    }
    my $response = $tx->success;
    my $text = $response->text;
    my @contents = $text =~ m{(<PubmedArticle>.*?</PubmedArticle>)}gs;
    for my $content (@contents) {
        # do more things here
    }
}
{{< / highlight >}}

I tried **Mojo::DOM** but the speed was not acceptable due to the volume of data.  Therefore, I shifted to regular expression.  Although regular expression is not a real parsing tool, its speed is satisfactory.  I tried BioPerl but it failed to work.  Therefore I re-writed the query by myself.  You may try E-Utilities and work out your own scripts.

關於作者

身為資訊領域碩士，位元詩人 (ByteBard) 認為開發應用程式的目的是為社會帶來價值。如果在這個過程中該軟體能成為永續經營的項目，那就是開發者和使用者雙贏的局面。

位元詩人喜歡用開源技術來解決各式各樣的問題，但必要時對專有技術也不排斥。閒暇之餘，位元詩人將所學寫成文章，放在這個網站上和大家分享。