三種數(shù)據(jù)抓取的方法
*利用之前構(gòu)建的下載網(wǎng)頁(yè)函數(shù),獲取目標(biāo)網(wǎng)頁(yè)的html,我們以https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/為例,獲取html。
from get_html import download url = 'https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/' page_content = download(url)
*假設(shè)我們需要爬取該網(wǎng)頁(yè)中的國(guó)家名稱和概況,我們依次使用這三種數(shù)據(jù)抓取的方法實(shí)現(xiàn)數(shù)據(jù)抓取。
1.正則表達(dá)式
from get_html import download import re url = 'https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/' page_content = download(url) country = re.findall('class="h2dabiaoti">(.*?)/h2>', page_content) #注意返回的是list survey_data = re.findall('tr>td bgcolor="#FFFFFF" id="wzneirong">(.*?)/td>/tr>', page_content) survey_info_list = re.findall('p> (.*?)/p>', survey_data[0]) survey_info = ''.join(survey_info_list) print(country[0],survey_info)
2.BeautifulSoup(bs4)
from get_html import download from bs4 import BeautifulSoup url = 'https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/' html = download(url) #創(chuàng)建 beautifulsoup 對(duì)象 soup = BeautifulSoup(html,"html.parser") #搜索 country = soup.find(attrs={'class':'h2dabiaoti'}).text survey_info = soup.find(attrs={'id':'wzneirong'}).text print(country,survey_info)
3.lxml
from get_html import download from lxml import etree #解析樹(shù) url = 'https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/' page_content = download(url) selector = etree.HTML(page_content)#可進(jìn)行xpath解析 country_select = selector.xpath('//*[@id="main_content"]/h2') #返回列表 for country in country_select: print(country.text) survey_select = selector.xpath('//*[@id="wzneirong"]/p') for survey_content in survey_select: print(survey_content.text,end='')
運(yùn)行結(jié)果:
最后,引用《用python寫(xiě)網(wǎng)絡(luò)爬蟲(chóng)》中對(duì)三種方法的性能對(duì)比,如下圖:
僅供參考。
總結(jié)
到此這篇關(guān)于python數(shù)據(jù)抓取3種方法的文章就介紹到這了,更多相關(guān)python數(shù)據(jù)抓取內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
標(biāo)簽:烏蘭察布 平頂山 大慶 合肥 烏蘭察布 哈爾濱 海南 郴州
巨人網(wǎng)絡(luò)通訊聲明:本文標(biāo)題《python數(shù)據(jù)抓取3種方法總結(jié)》,本文關(guān)鍵詞 python,數(shù)據(jù),抓取,3種,方法,;如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問(wèn)題,煩請(qǐng)?zhí)峁┫嚓P(guān)信息告之我們,我們將及時(shí)溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò),涉及言論、版權(quán)與本站無(wú)關(guān)。