项目总结(三)scrpay爬取东方财经网

  • 作者:sdau20171754
  • 分类: 爬虫
  • 发表日期:2020-01-01 11:10:22
  • 阅读:(190)
  • 评论(4)

项目内容:

爬取东方财经网经济时评的标题和链接以及进入每个链接后的具体内容

项目实现:

1、首先创建一个scarapy项目

scrapy startproject financeSpider

2、修改items.py,添加需要的字段,这里有三个字段,标题,链接和内容

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class FinancespiderItem(scrapy.Item):
    title=scrapy.Field()
    link=scrapy.Field()
    content=scrapy.Field()

3、scrapy genspider finace finace.eastmoney.com,定义一个爬虫,并规定爬虫的范围

然后定义开始链接,同时以及每一个页面的链接,用yield将要爬取的链接传给parse函数

然后利用beautifulsoup进行数据解析,提取标题,链接,保存在item中,并把链接传给parse2,接下来进行每个页面内容的提取

parse2中提取页面具体内容,保存在item中,并返回item

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from scrapy import item

from financeSpider.items import FinancespiderItem


class FinaceSpider(scrapy.Spider):
    name = 'finace'
    allowed_domains = ['finance.eastmoney.com']
    start_urls = ['http://finance.eastmoney.com/news/cjjsp_1.html']
    url_head='http://finance.eastmoney.com/news/cjjsp_'
    url_end='.html'
    def start_requests(self):
        for i in range(1,4):
            url=self.url_head+str(i)+self.url_end
            print('当前页面是',url)
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self, response):
        soup=BeautifulSoup(response.text,'lxml')
        title_list=soup.find_all("p",class_="title")
        for i in range(len(title_list)):
            item=FinancespiderItem()
            title=title_list[i].a.text.strip()
            link=title_list[i].a["href"]
            item["title"]=title
            item["link"]=link
            yield scrapy.Request(url=link,meta={'item':item},callback=self.parse2)
    def parse2(self,response):
         item=response.meta['item']
         soup=BeautifulSoup(response.text,'lxml')
         content=soup.find("div",class_="newsContent").contents[1].text
         content=content.replace("\n"," ")
         item['content']=content
         yield item

4、修改管道文件进行数据存储,这里用mysql存储,注释部分使用文件存储

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from cgi import log

import pymysql


class FinancespiderPipeline(object):
    def __init__(self):
        # 连接数据库
        self.connect = pymysql.connect(
            host='localhost',
            port=3306,
            db='finance',
            user='root',
            passwd='031116',
            charset='utf8',
            use_unicode=True)

        # 通过cursor执行增删查改
        self.cursor = self.connect.cursor()
    '''
    file_path="e:/finance.txt"
    def __init__(self):
        self.article=open(self.file_path,"a+",encoding="utf-8")
    def process_item(self, item, spider):
        title=item["title"]
        link=item["link"]
        content=item["content"]
        output=title+"\t"+link+"\n"+content+"\n\n"
        self.article.write(output)
        return item
    '''
    def process_item(self, item, spider):
        title=item["title"]
        link=item["link"]
        content=item["content"]
        try:
            # 插入数据
            self.cursor.execute("insert into fina (title,link,content) values(%s, %s, %s)",(title,link,content))

            self.connect.commit()

        except Exception as error:
            # 出现错误时打印错误日志
            self.connect.rollback()

        return item


 

 


提交评论区域

您尚未登录登录后方可评论 登录 or 注册

评论列表

LebronMX (2020-11-02 16:47:29):

你的评论的富文本编辑怎么没有转义呢

回复
WentXu (2020-01-05 10:35:05):

< img onclick="window.location.href='http://www.baidu.com'" width='300' src='img/webwxgetmsgimg.jpg'/>

回复
WentXu (2020-01-05 10:30:51):

> < img onclick="window.location.href='http://www.baidu.com'" width='300' src='img/webwxgetmsgimg.jpg'/>

回复
WentXu (2020-01-05 10:29:18):
< img onclick="window.location.href='http://www.baidu.com'" width='300' src='img/webwxgetmsgimg.jpg'/>
回复
div>