Using Python to check Google Scholar search result

After I transferred gene identifiers from one assembly to another, I would like to know whether those published old genes were changed on the new genome. First I need to know which genes had been studied and published. Google Scholar is a good place to do this. Although there is no official API, some scripts are available for various purposes.

I have a list of S.mansoni genes and I would like to know whether they have been mentioned in any publication. Thus, a simple “YES or NO” query is sufficient. When you open Google Scholar and get search results, you will find a sentence like this “4 results (0.02 sec)"; if no result it says “Your search - XXX - did not match any articles.” So we could use Python to download the whole HTML page and extract those sentences. A very brief script looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


#! /usr/bin/env python

# usage: python gscholar-results.py Smp_070360 | grep -o ' not match \| result'
# pay attention to the whitespace before "match" and "result": not to extract those in page elements

from urllib import FancyURLopener
import sys
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open
query = sys.argv[1]
url = 'https://scholar.google.com/scholar?q=' + query + '&btnG=&hl=en&as_sdt=0%2C5'
content = openurl(url).read()

print(content[72010:151255]) # character range containing those sentences

You find a brief summary like this:

1
2
3
4


Smp_019790: result
Smp_070360: result
Smp_123450: not match
Smp_170040: not match

A modified version with User-Agent added, and regexp within the script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


#! /usr/bin/env python

# usage: python gscholar-results.py Smp_156980

import sys
import re
import urllib2

user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
query = sys.argv[1]
url = 'https://scholar.google.com/scholar?q=' + query + '&btnG=&hl=en&as_sdt=0%2C5'
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
p_content = response.read()
response.close() # its always safe to close an open connection

str = p_content[72010:151255]
pat = re.compile(r"\d*\sresult\w?\s")

if pat.search(str) != None:
    print(pat.search(str).group())
else:
    print("No match")

grep the content

grep -r . ./*.txt | sed 's/\.\///'| sed 's/\.txt:/: /'

Which will show the number of results

1
2
3
4


Smp_019790: 4 results
Smp_070360: 1 result
Smp_123450: No match
Smp_170040: No match

Note: Sending frequent queries (i.e., using a loop) to their server will cause your IP being blocked.

Furthermore, if you want to parse the results, there is a nice script available: scholar.py, which requires that BeautifulSoup to be installed first. Wiredly for the same task performed above I always got “Results 0”, although the publications were shown after.

python scholar.py --txt-globals -s "Smp_070360"

But before fixing it you can get the total number by “grep Title | wc -l”.