python爬虫--爬取网站中的多个网页

qq

发布日期: 2019-04-25 08:45:45 浏览量: 296
评分:
star star star star star star star star star_border star_border
*转载请注明来自write-bug.com

爬取7k7k小游戏的URL

  1. # -*- coding: utf-8 -*-
  2. """
  3. Created on Sun Mar 24 10:04:58 2019
  4. @author: pry
  5. """
  6. import requests
  7. from bs4 import BeautifulSoup
  8. import os
  9. import re
  10. import urllib
  11. from lxml import etree
  12. def parse_page():
  13. t = 1
  14. headers = {
  15. 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3642.0 Safari/537.36'
  16. }
  17. for i in range(1,5):
  18. url_i = 'http://www.7k7k.com/flash_fl/461_' + str(i) + '.htm'
  19. response_i = requests.get(url_i, headers = headers)
  20. selector = etree.HTML(response_i.text, parser=etree.HTMLParser(encoding = 'utf-8'))
  21. print(url_i)
  22. content = selector.xpath('//a/@href')
  23. for i in content:
  24. if i[0] == "j":
  25. continue
  26. if i[0] == "/":
  27. i = url_i + i
  28. with open('7k7k_urls.txt','a+') as file:
  29. file.write(i)
  30. file.write("\n")
  31. file.close()
  32. print(i)
  33. t = t + 1
  34. print(t)
  35. print('ok')
  36. if __name__ == '__main__':
  37. parse_page()
上传的附件 cloud_download 7k7k_urls.txt ( 489.85kb, 3次下载 )

keyboard_arrow_left上一篇 : C语言-测试时用来快速转换数组与单链表 大数据 12、中文分词 : 下一篇keyboard_arrow_right



YoungTime
2019-04-25 09:32:56
谢谢分享, 通俗易懂啊!原理就是: 1. 获取 http://www.7k7k.com/flash_fl/461_页码.htm 网页内容; 2. 从返回的页面内容中获取url链接并写入txt文件中; 3. 重复上述操作~~

qq

发送私信

11
文章数
0
评论数
最近文章
eject