python爬虫之批量下载文献

zi_hu

编辑于 2022年03月22日 16:42

收录于文集

共1篇

本教程参考视频：

06:02

python爬虫之批量下载文献

1.1万

学术网站获取bib格式文件（例如scoups，web of science，google scholar等）
查看sci-hub网站，调用格式，获取网站url（打开网页开发者模式，F12）
编写函数，获取导入bib文件，并格式化字段输出
编写获取文章下载链接函数（定义方法，以题目查找，以题目查找或者以文章DOI查找）
编写下载PDF函数，以及主函数

步骤详解：

一、获取bib文件，这里以scoups网站为例：

https://www.scopus.com/

（1）以关键字搜索获取文献列表

（2）设置搜索范围，勾选全部，点击导出BibTex

（3）设置导出字段，点击bib格式，点击导出

二、查看sci-hub网站，调用格式，获取网站url（打开网页开发者模式，F12）

https://sci-hub.se/

（1）以题目搜索

（2）在网络，监听事件，点击响应，返回正确之后，在点击标题获取请求方法

三、编写函数，获取导入bib文件，并格式化字段输出

（1）bib文件格式预览

注意：这里bib文件格式里的关键字可能不一样，如果在不同网站导出的话，这里的以scoups为例下载的，其他网站需要对"into_bib"的正则化部分做相应的修改

（2）参考代码

 代码块
Python
自动换行
复制代码
def into_bib(file_tix_in):
    &amp;quot;&amp;quot;&amp;quot;
    Import the bib file and output the paper information
    ----------------------
    Input: bib file address
    ----------------------
    Output: Bib file matches
    &amp;quot;&amp;quot;&amp;quot;
    file = open(file_tix_in, mode=&amp;#39;r&amp;#39;, encoding=&amp;#39;utf-8&amp;#39;)
    lines = &amp;quot;&amp;quot;
    for line in file.readlines():
        line = line.replace(&amp;#39;\n&amp;#39;, &amp;#39;-&amp;#39;)
        lines = lines + line
    lines = re.sub(r&amp;#39;(\s \s)&amp;#39;, &amp;#39; &amp;#39;, lines)
    pattern_author = re.compile(r&amp;#39;-author={([A-Z -][^\s,]+)&amp;#39;, re.I)
    pattern_year = re.compile(r&amp;#39;-year={([0-9]+)&amp;#39;)
    pattern_doi = re.compile(r&amp;#39;-doi={(?!})(?!{)([a-zA-Z0-9 /. - ()]+)&amp;#39;)
    pattern_title = re.compile(r&amp;#39;-title={(?!})(?!{)([a-zA-Z0-9 \-:\s \&amp;#39;]+)&amp;#39;)
    match_author = pattern_author.findall(lines)
    match_year = pattern_year.findall(lines)
    match_doi = pattern_doi.findall(lines)
    match_title = pattern_title.findall(lines)
    match = [match_author, match_year, match_doi, match_title]
    return match复制成功

四、编写获取文章下载链接函数

参考代码

注意：获取下载链接的代码，会根据网站变的，网站更新需要做相应的变动

如果 v_1 有问题，尝试用 v_2 运行

 代码块
Python
自动换行
复制代码
############################ v_1 ############################
def search_paper(artName):
    &amp;quot;&amp;quot;&amp;quot;
    Search papers
    ---------------
    Input: the name of paper
    ---------------
    Output: search results (if &amp;quot;&amp;quot; is not returned, otherwise PDF link is returned)
    &amp;quot;&amp;quot;&amp;quot;
    url = &amp;#39;https://www.sci-hub.ren/&amp;#39;
    # url = &amp;#39;https://click.endnote.com/&amp;#39;
    headers = {&amp;#39;User-Agent&amp;#39;: &amp;#39;Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0&amp;#39;,
               &amp;#39;Accept&amp;#39;: &amp;#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8&amp;#39;,
               &amp;#39;Accept-Language&amp;#39;: &amp;#39;zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2&amp;#39;,
               &amp;#39;Accept-Encoding&amp;#39;: &amp;#39;gzip, deflate, br&amp;#39;,
               &amp;#39;Content-Type&amp;#39;: &amp;#39;application/x-www-form-urlencoded&amp;#39;,
               &amp;#39;Content-Length&amp;#39;: &amp;#39;123&amp;#39;,
               &amp;#39;Origin&amp;#39;: &amp;#39;https://www.sci-hub.ren&amp;#39;,
               &amp;#39;Connection&amp;#39;: &amp;#39;keep-alive&amp;#39;,
               &amp;#39;Upgrade-Insecure-Requests&amp;#39;: &amp;#39;1&amp;#39;}
    data = {&amp;#39;sci-hub-plugin-check&amp;#39;: &amp;#39;&amp;#39;,
            &amp;#39;request&amp;#39;: artName}
    res = requests.post(url, headers=headers, data=data)
    html = res.text
    soup = BeautifulSoup(html, &amp;#39;html.parser&amp;#39;)
    try:
        iframe = soup.find(id=&amp;#39;buttons&amp;#39;)
        tem_out = iframe.contents
        downUrl_out = re.findall(r&amp;#39;href=\&amp;#39;([^&amp;quot;]*)&amp;#39;, str(tem_out))
        downUrl_out = url + downUrl_out[0]
    except:
        return None
    return downUrl_out
############################ v_2 ############################
from lxml import etree
def search_paper(artName):
    &amp;quot;&amp;quot;&amp;quot;
    Search papers
    ---------------
    Input: the name of paper
    ---------------
    Output: search results (if &amp;quot;&amp;quot; is not returned, otherwise PDF link is returned)
    &amp;quot;&amp;quot;&amp;quot;
    url = &amp;#39;https://www.sci-hub.ren/&amp;#39;
    # url = &amp;#39;https://click.endnote.com/&amp;#39;
    headers = {&amp;#39;User-Agent&amp;#39;: &amp;#39;Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0&amp;#39;,
               &amp;#39;Accept&amp;#39;: &amp;#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8&amp;#39;,
               &amp;#39;Accept-Language&amp;#39;: &amp;#39;zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2&amp;#39;,
               &amp;#39;Accept-Encoding&amp;#39;: &amp;#39;gzip, deflate, br&amp;#39;,
               &amp;#39;Content-Type&amp;#39;: &amp;#39;application/x-www-form-urlencoded&amp;#39;,
               &amp;#39;Content-Length&amp;#39;: &amp;#39;123&amp;#39;,
               &amp;#39;Origin&amp;#39;: &amp;#39;https://www.sci-hub.ren&amp;#39;,
               &amp;#39;Connection&amp;#39;: &amp;#39;keep-alive&amp;#39;,
               &amp;#39;Upgrade-Insecure-Requests&amp;#39;: &amp;#39;1&amp;#39;}
    data = {&amp;#39;sci-hub-plugin-check&amp;#39;: &amp;#39;&amp;#39;,
            &amp;#39;request&amp;#39;: artName}
    res = requests.post(url, headers=headers, data=data)
    html = res.text
    tree = etree.HTML(html)
    try:
        url = tree.xpath(&amp;quot;//*[@id=&amp;#39;buttons&amp;#39;]/button/@onclick&amp;quot;)
        url_d = &amp;#39;https://sci-hub.se/&amp;#39;
        downUrl_out = url_d + url[0].split(&amp;quot;&amp;#39;&amp;quot;)[1]
    except:
        return None
    return downUrl_out复制成功

五、编写下载PDF函数，以及主函数

注意：如果用doi方式下载文献，bib文件中不允许出现没有doi的参考文献，file_tix 为bib文件的路径，save_tix 为paper的保存路劲，保存方式为作者加年份

参考代码

（1）下载PDF

 代码块
Python
自动换行
复制代码
def download_paper(downUrl_in):
    &amp;quot;&amp;quot;&amp;quot;
    Download the paper according to the paper link
    ----------------------
    Input: paper link
    ----------------------
    Output: PDF binary files
    &amp;quot;&amp;quot;&amp;quot;
    headers = {&amp;#39;User-Agent&amp;#39;: &amp;#39;Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0&amp;#39;,
               &amp;#39;Accept&amp;#39;: &amp;#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8&amp;#39;,
               &amp;#39;Accept-Language&amp;#39;: &amp;#39;zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2&amp;#39;,
               &amp;#39;Accept-Encoding&amp;#39;: &amp;#39;gzip, deflate, br&amp;#39;,
               &amp;#39;Connection&amp;#39;: &amp;#39;keep-alive&amp;#39;,
               &amp;#39;Upgrade-Insecure-Requests&amp;#39;: &amp;#39;1&amp;#39;}
    res = requests.get(downUrl_in, headers=headers)
    return res.content复制成功

（2）主函数

 代码块
Python
自动换行
复制代码
if __name__ == &amp;#39;__main__&amp;#39;:
    # bib file address
    file_tix = r&amp;quot;bib\scopus.bib&amp;quot;
    # File storage address
    save_tix = r&amp;quot;paper\\&amp;quot;
    if not os.path.exists(save_tix):
        os.makedirs(save_tix)
    # find_way = 2,it&amp;#39;s by DOI. find_way = 3,it&amp;#39;s by Title
    find_way = 3
    paper_find = into_bib(file_tix)
    print(&amp;quot;Bib contains {num} papers&amp;quot;.format(num=len(paper_find[1])))
    if find_way is 2 and len(paper_find[1]) != len(paper_find[2]):
        print(&amp;quot;Records contain missing BOI records. Please select the search method as paper title search, &amp;quot;
              &amp;quot;or complete (delete) the missing DOI records&amp;quot;)
        sys.exit()
    download_code = []
    for tix_num in range(len(paper_find[1])):
        print(&amp;#39;NO.{num} Searching...&amp;#39;.format(num=tix_num + 1))
        downUrl = search_paper(paper_find[find_way][tix_num])
        if downUrl is None:
            print(&amp;#39;NO.{num} Not found!&amp;#39;.format(num=tix_num + 1))
            download_code.append(tix_num + 1)
        else:
            print(&amp;#39;NO.{num} Paper link:{paper_link}&amp;#39;.format(
                num=tix_num + 1, paper_link=downUrl))
            print(&amp;#39;Downloading...&amp;#39;)
            pdf = download_paper(downUrl)
            paper_name = paper_find[0][tix_num] + paper_find[1][tix_num]
            with open(&amp;#39;%s.pdf&amp;#39; % (save_tix + paper_name), &amp;#39;wb&amp;#39;) as f:
                f.write(pdf)
            print(&amp;#39;---Download complete---&amp;#39;)
        time.sleep(0.8)
    print(&amp;quot;The papers records not found are NO.{num}&amp;quot;.format(
        num=download_code))
    print(&amp;quot;The title of papers that was not found are:&amp;quot;)
    for ii in range(len(download_code)):
        print(&amp;quot;NO.{num}: {title}&amp;quot;.format(
            num=download_code[ii], title=paper_find[3][download_code[ii] - 1]))复制成功

六、整体参考代码

参考代码

 代码块
Python
自动换行
复制代码
# -*- coding: utf-8 -*-
&amp;quot;&amp;quot;&amp;quot;
The program is used to download papers in batch.
The input data is BibTex file.
The program has two search methods:
1. search according to the title of the paper
2. search according to the DOI number of the paper
September 2022/02/15 python 3.6
&amp;quot;&amp;quot;&amp;quot;
import time
import re
import requests
from lxml import etree
import sys
import os


def search_paper(artName):
    &amp;quot;&amp;quot;&amp;quot;
    Search papers
    ---------------
    Input: the name of paper
    ---------------
    Output: search results (if &amp;quot;&amp;quot; is not returned, otherwise PDF link is returned)
    &amp;quot;&amp;quot;&amp;quot;
    url = &amp;#39;https://www.sci-hub.ren/&amp;#39;
    # url = &amp;#39;https://click.endnote.com/&amp;#39;
    headers = {&amp;#39;User-Agent&amp;#39;: &amp;#39;Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0&amp;#39;,
               &amp;#39;Accept&amp;#39;: &amp;#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8&amp;#39;,
               &amp;#39;Accept-Language&amp;#39;: &amp;#39;zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2&amp;#39;,
               &amp;#39;Accept-Encoding&amp;#39;: &amp;#39;gzip, deflate, br&amp;#39;,
               &amp;#39;Content-Type&amp;#39;: &amp;#39;application/x-www-form-urlencoded&amp;#39;,
               &amp;#39;Content-Length&amp;#39;: &amp;#39;123&amp;#39;,
               &amp;#39;Origin&amp;#39;: &amp;#39;https://www.sci-hub.ren&amp;#39;,
               &amp;#39;Connection&amp;#39;: &amp;#39;keep-alive&amp;#39;,
               &amp;#39;Upgrade-Insecure-Requests&amp;#39;: &amp;#39;1&amp;#39;}
    data = {&amp;#39;sci-hub-plugin-check&amp;#39;: &amp;#39;&amp;#39;,
            &amp;#39;request&amp;#39;: artName}
    res = requests.post(url, headers=headers, data=data)
    html = res.text
    tree = etree.HTML(html)
    try:
        url = tree.xpath(&amp;quot;//*[@id=&amp;#39;buttons&amp;#39;]/button/@onclick&amp;quot;)
        url_d = &amp;#39;https://sci-hub.se/&amp;#39;
        downUrl_out = url_d + url[0].split(&amp;quot;&amp;#39;&amp;quot;)[1]
    except:
        return None
    return downUrl_out


def download_paper(downUrl_in):
    &amp;quot;&amp;quot;&amp;quot;
    Download the paper according to the paper link
    ----------------------
    Input: paper link
    ----------------------
    Output: PDF binary files
    &amp;quot;&amp;quot;&amp;quot;
    headers = {&amp;#39;User-Agent&amp;#39;: &amp;#39;Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0&amp;#39;,
               &amp;#39;Accept&amp;#39;: &amp;#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8&amp;#39;,
               &amp;#39;Accept-Language&amp;#39;: &amp;#39;zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2&amp;#39;,
               &amp;#39;Accept-Encoding&amp;#39;: &amp;#39;gzip, deflate, br&amp;#39;,
               &amp;#39;Connection&amp;#39;: &amp;#39;keep-alive&amp;#39;,
               &amp;#39;Upgrade-Insecure-Requests&amp;#39;: &amp;#39;1&amp;#39;}
    res = requests.get(downUrl_in, headers=headers)
    return res.content


def into_bib(file_tix_in):
    &amp;quot;&amp;quot;&amp;quot;
    Import the bib file and output the paper information
    ----------------------
    Input: bib file address
    ----------------------
    Output: Bib file matches
    &amp;quot;&amp;quot;&amp;quot;
    file = open(file_tix_in, mode=&amp;#39;r&amp;#39;, encoding=&amp;#39;utf-8&amp;#39;)
    lines = &amp;quot;&amp;quot;
    for line in file.readlines():
        line = line.replace(&amp;#39;\n&amp;#39;, &amp;#39;-&amp;#39;)
        lines = lines + line
    lines = re.sub(r&amp;#39;(\s \s)&amp;#39;, &amp;#39; &amp;#39;, lines)


    pattern_author = re.compile(r&amp;#39;-author={([A-Z -][^\s,]+)&amp;#39;, re.I)
    pattern_year = re.compile(r&amp;#39;-year={([0-9]+)&amp;#39;)
    pattern_doi = re.compile(r&amp;#39;-doi={(?!})(?!{)([a-zA-Z0-9 /. - ()/\//]+)&amp;#39;)
    pattern_title = re.compile(r&amp;#39;-title={(?!})(?!{)([a-zA-Z0-9 \-:\s \&amp;#39;]+)&amp;#39;)

    # pattern_author = re.compile(r&amp;#39;-Author = {([A-Z -][^\s,]+)&amp;#39;, re.I)
    # pattern_year = re.compile(r&amp;#39;-Year = {{([0-9]+)&amp;#39;)
    # pattern_doi = re.compile(r&amp;#39;-DOI = {{(?!})(?!{)([a-zA-Z0-9 /. - ()/\//]+)&amp;#39;)
    # pattern_title = re.compile(r&amp;#39;-Title = {{(?!})(?!{)([a-zA-Z0-9 \-:\s \&amp;#39;]+)&amp;#39;)

    match_author = pattern_author.findall(lines)
    match_year = pattern_year.findall(lines)
    match_doi = pattern_doi.findall(lines)
    match_title = pattern_title.findall(lines)
    match = [match_author, match_year, match_doi, match_title]
    return match


if __name__ == &amp;#39;__main__&amp;#39;:
    # bib file address
    file_tix = r&amp;quot;bib\scopus.bib&amp;quot;
    # File storage address
    save_tix = r&amp;quot;paper\\&amp;quot;
    if not os.path.exists(save_tix):
        os.makedirs(save_tix)
    # find_way = 2,it&amp;#39;s by DOI. find_way = 3,it&amp;#39;s by Title
    find_way = 2
    paper_find = into_bib(file_tix)
    print(&amp;quot;Bib contains {num} papers&amp;quot;.format(num=len(paper_find[1])))
    if find_way is 2 and len(paper_find[1]) != len(paper_find[2]):
        print(&amp;quot;Records contain missing BOI records. Please select the search method as paper title search, &amp;quot;
              &amp;quot;or complete (delete) the missing DOI records&amp;quot;)
        # sys.exit()
    download_code = []
    for tix_num in range(len(paper_find[1])):
        print(&amp;#39;NO.{num} Searching...&amp;#39;.format(num=tix_num + 1))
        downUrl = search_paper(paper_find[find_way][tix_num])
        if downUrl is None:
            print(&amp;#39;NO.{num} Not found!&amp;#39;.format(num=tix_num + 1))
            download_code.append(tix_num + 1)
        else:
            print(&amp;#39;NO.{num} Paper link:{paper_link}&amp;#39;.format(
                num=tix_num + 1, paper_link=downUrl))
            print(&amp;#39;Downloading...&amp;#39;)
            pdf = download_paper(downUrl)
            paper_name = paper_find[0][tix_num] + paper_find[1][tix_num]
            with open(&amp;#39;%s.pdf&amp;#39; % (save_tix + paper_name), &amp;#39;wb&amp;#39;) as f:
                f.write(pdf)
            print(&amp;#39;---Download complete---&amp;#39;)
        time.sleep(0.8)
    print(&amp;quot;The papers records not found are NO.{num}&amp;quot;.format(
        num=download_code))
    print(&amp;quot;The title of papers that was not found are:&amp;quot;)
    for ii in range(len(download_code)):
        print(&amp;quot;NO.{num}: {title}&amp;quot;.format(
            num=download_code[ii], title=paper_find[3][download_code[ii] - 1]))
复制成功

爬虫 python sci-hub 批量下载文献

本文为我原创，未经授权禁止转载

cv15331007

分享至

投诉或建议