作為一個資深的小說愛好者,國內(nèi)很多小說網(wǎng)站如出一轍,什么 🖊*閣啊等等,大都是 get 請求返回 html 內(nèi)容,而且會有標志性的dl>dd>等標簽。
所以大概的原理,就是先 get 請求這個網(wǎng)站,然后對獲取的內(nèi)容進行清洗,寫進文本里面,變成一個 txt,導(dǎo)入手機,方便看小說。
之前踩過一個坑,一開始我看了幾頁小說,大概小說的內(nèi)容網(wǎng)站是https://www.xxx.com/小說編號/章節(jié)編號.html,一開始看前幾章,我發(fā)現(xiàn)章節(jié)編號是連續(xù)的, 于是我一開始想的就是記住起始章節(jié)編號,然后在循環(huán)的時候章節(jié)編號自增就行,后面發(fā)現(xiàn)草率了,可能看個 100 章之后,章節(jié)列表會出現(xiàn)斷層現(xiàn)象,這個具體為啥 還真不知道,按理說小說編號固定,可以算是一個數(shù)據(jù)表,那里面的章節(jié)編號不就是一個自增 id 就完了嘛?有懂王可以科普一下!
所以這里要先獲取小說的目錄列表,并把目錄列表洗成一個數(shù)組方便我們后期查找!getList.py文件:
定義一個請求書簽的方法
# 請求書簽地址 def req(): url = "https://www.24kwx.com/book/4/4020/" strHtml = requests.get(url) return strHtml.text
將獲取到的內(nèi)容提取出(id:唯一值/或第 X 章小說)(name:小說的章節(jié)名稱)(key:小說的章節(jié) id)
# 定義一個章節(jié)對象 class Xs(object): def __init__(self,id,key,name): self._id = id self._key = key self._name = name @property def id(self): self._id @property def key(self): self._key @property def name(self): self._name def getString(self): return 'id:%s,name:%s,key:%s' %(self._id,self._name,self._key) # 轉(zhuǎn)換成書列表 def tranceList(): key = 0 name = "" xsList = [] idrule = r'/4020/(.+?).html' keyrule = r'第(.+?)章' html = req() html = re.split("/dt>",html)[2] html = re.split("/dl>",html)[0] htmlList = re.split("/dd>",html) for i in htmlList: i = i.strip() if(i): # 獲取id id = re.findall(idrule,i)[0] lsKeyList = re.findall(keyrule,i) # 如果有章節(jié) if len(lsKeyList) > 0 : key = int(lsKeyList[0]) lsname = re.findall(r'章(.+?)/a>',i) else : key = key + 1 # 獲取名字 # lsname = re.findall(r'.html">(.+?)/a>',i)[0] # name = re.sub(',',' ', lsname, flags=re.IGNORECASE) name = re.findall(r'.html">(.+?)/a>',i)[0] xsobj = Xs(id,key,name) xsList.append(xsobj.getString()) writeList(xsList)
注意一下我:如果你從別的語言轉(zhuǎn) py,第一次寫object對象可能會比較懵,沒錯因為他的object是一個class,這里我創(chuàng)建的對象就是{id,key,name}但是你寫入 txt 的時候還是要getString,所以后面想想我直接寫個{id:xxx,name:xxx,key:xxx}的字符串不就完了,還弄啥class,后面還是想想給兄弟盟留點看點,就留著了
最后寫入 txt 文件
# 寫入到文本 def writeList(list): f = open("xsList.txt",'w',encoding='utf-8') # 這里不能寫list,要先轉(zhuǎn)字符串 TypeError: write() argument must be str, not list f.write('\n'.join(list)) print('寫入成功') # 大概寫完的txt是這樣的 id:3798160,name:第1章 孫子,我是你爺爺,key:1 id:3798161,name:第2章 孫子,等等我!,key:2 id:3798162,name:第3章 天上掉下個親爺爺,key:3 id:3798163,name:第4章 超級大客戶,key:4 id:3798164,name:第5章 一張退婚證明,key:5
ok ! Last one
這里已經(jīng)寫好了小說的目錄,那我們就要讀取小說的內(nèi)容,同理
先寫個請求
# 請求內(nèi)容地址 def req(id): url = "https://www.24kwx.com/book/4/4020/"+id+".html" strHtml = requests.get(url) return strHtml.text
讀取我們剛剛保存的目錄
def getList(): f = open("xsList.txt",'r', encoding='utf-8') # 這里按行讀取,讀取完后line是個數(shù)組 line = f.readlines() f.close() return line
定義好一個清洗數(shù)據(jù)的規(guī)則
contextRule = r'div class="content">(.+?)script>downByJs();/script>' titleRule = r'h1>(.+?)/h1>' def getcontext(objstr): xsobj = re.split(",",objstr) id = re.split("id:",xsobj[0])[1] name = re.split("name:",xsobj[1])[1] html = req(id) lstitle = re.findall(titleRule,html) title = lstitle[0] if len(lstitle) > 0 else name context = re.split('div id="content" class="showtxt">',html)[1] context = re.split('/div>',context)[0] context = re.sub('nbsp;|\r|\n','',context) textList = re.split('br />',context) textList.insert(0,title) for item in textList : writeTxt(item) print('%s--寫入成功'%(title))
再寫入文件
def writeTxt(txt): if txt : f = open("nr.txt",'a',encoding="utf-8") f.write(txt+'\n')
最后當然是串聯(lián)起來啦
def getTxt(): # 默認參數(shù)配置 startNum = 1261 # 起始章節(jié) endNum = 1300 # 結(jié)束章節(jié) # 開始主程序 f = open("nr.txt",'w',encoding='utf-8') f.write("") if endNum startNum: print('結(jié)束條數(shù)必須大于開始條數(shù)') return allList = getList() needList = allList[startNum-1:endNum] for item in needList: getcontext(item) time.sleep(0.2) print("全部爬取完成")
getList.py
import requests import re # 請求書簽地址 def req(): url = "https://www.24kwx.com/book/4/4020/" strHtml = requests.get(url) return strHtml.text # 定義一個章節(jié)對象 class Xs(object): def __init__(self,id,key,name): self._id = id self._key = key self._name = name @property def id(self): self._id @property def key(self): self._key @property def name(self): self._name def getString(self): return 'id:%s,name:%s,key:%s' %(self._id,self._name,self._key) # 轉(zhuǎn)換成書列表 def tranceList(): key = 0 name = "" xsList = [] idrule = r'/4020/(.+?).html' keyrule = r'第(.+?)章' html = req() html = re.split("/dt>",html)[2] html = re.split("/dl>",html)[0] htmlList = re.split("/dd>",html) for i in htmlList: i = i.strip() if(i): # 獲取id id = re.findall(idrule,i)[0] lsKeyList = re.findall(keyrule,i) # 如果有章節(jié) if len(lsKeyList) > 0 : key = int(lsKeyList[0]) lsname = re.findall(r'章(.+?)/a>',i) else : key = key + 1 # 獲取名字 # lsname = re.findall(r'.html">(.+?)/a>',i)[0] # name = re.sub(',',' ', lsname, flags=re.IGNORECASE) name = re.findall(r'.html">(.+?)/a>',i)[0] xsobj = Xs(id,key,name) xsList.append(xsobj.getString()) writeList(xsList) # 寫入到文本 def writeList(list): f = open("xsList.txt",'w',encoding='utf-8') # 這里不能寫list,要先轉(zhuǎn)字符串 TypeError: write() argument must be str, not list f.write('\n'.join(list)) print('寫入成功') def main(): tranceList() if __name__ == '__main__': main()
writeTxt.py
import requests import re import time # 請求內(nèi)容地址 def req(id): url = "https://www.24kwx.com/book/4/4020/"+id+".html" strHtml = requests.get(url) return strHtml.text def getList(): f = open("xsList.txt",'r', encoding='utf-8') # 這里按行讀取 line = f.readlines() f.close() return line contextRule = r'div class="content">(.+?)script>downByJs();/script>' titleRule = r'h1>(.+?)/h1>' def getcontext(objstr): xsobj = re.split(",",objstr) id = re.split("id:",xsobj[0])[1] name = re.split("name:",xsobj[1])[1] html = req(id) lstitle = re.findall(titleRule,html) title = lstitle[0] if len(lstitle) > 0 else name context = re.split('div id="content" class="showtxt">',html)[1] context = re.split('/div>',context)[0] context = re.sub('nbsp;|\r|\n','',context) textList = re.split('br />',context) textList.insert(0,title) for item in textList : writeTxt(item) print('%s--寫入成功'%(title)) def writeTxt(txt): if txt : f = open("nr.txt",'a',encoding="utf-8") f.write(txt+'\n') def getTxt(): # 默認參數(shù)配置 startNum = 1261 # 起始章節(jié) endNum = 1300 # 結(jié)束章節(jié) # 開始主程序 f = open("nr.txt",'w',encoding='utf-8') f.write("") if endNum startNum: print('結(jié)束條數(shù)必須大于開始條數(shù)') return allList = getList() needList = allList[startNum-1:endNum] for item in needList: getcontext(item) time.sleep(0.2) print("全部爬取完成") def main(): getTxt() if __name__ == "__main__": main()
以上就是python 爬取國內(nèi)小說網(wǎng)站的詳細內(nèi)容,更多關(guān)于python 爬取小說網(wǎng)站的資料請關(guān)注腳本之家其它相關(guān)文章!