爬虫 | 乐趣区

关于爬虫:什么自如租房价格是图片一Python爬虫

前几天有个敌人想爬一下望京左近自若租房的价格，遇到点问题，想让我帮忙剖析一下。1 剖析我就想着，这货色我以前搞过呀，还能有啥难度不成。于是轻易关上一个租房页面。额(⊙o⊙)… 居然换成了图片。之前应该是有个独自的Ajax申请，去获取价格信息的。依据页面内容，能够看到：①尽管价格是由4个<i>标签组成，但背景图片却是雷同的。②有CSS能够看到，每个价格窗口的大小width: 20px; height: 30px;是固定的，仅仅设置了图片的偏移量。不过这也难不倒我，整顿下思路：申请取得网页获取图片信息，获取价格偏移量信息切割图片进行辨认失去价格数据正好最近在钻研CNN图片辨认相干的，这么规整的数字，稍加训练识别率必定能够达到100%。 2 实战说干就干，先找个入口，而后获取一波网页再说。 2.1 获取原始网页间接按地铁来吧，找个15号线望京东，而后获取房间列表，同时再解决下分页就好了。示例代码： # -*- coding: UTF-8 -*-import osimport timeimport randomimport requestsfrom lxml.etree import HTML__author__ = 'lpe234'index_url = 'https://www.ziroom.com/z/s100006-t201081/?isOpen=0'visited_index_urls = set()def get_pages(start_url: str): """ 地铁15号线望京东左近房源，拿到首页所有详情页地址 :param start_url :return: """ # 去重 if start_url in visited_index_urls: return visited_index_urls.add(start_url) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36' } resp = requests.get(start_url, headers=headers) resp_content = resp.content.decode('utf-8') root = HTML(resp_content) # 解析当页列表 hrefs = root.xpath('//div[@class="Z_list-box"]/div/div[@class="pic-box"]/a/@href') for href in hrefs: if not href.startswith('http'): href = 'http:' + href.strip() print(href) parse_detail(href) # 解析翻页 pages = root.xpath('//div[@class="Z_pages"]/a/@href') for page in pages: if not page.startswith('http'): page = 'http:' + page get_pages(page)def parse_detail(detail_url: str): """ 拜访详情页 :param detail_url: :return: """ headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36' } filename = 'pages/' + detail_url.split('/')[-1] if os.path.exists(filename): return # 随机暂停1-5秒 time.sleep(random.randint(1, 5)) resp = requests.get(detail_url, headers=headers) resp_content = resp.content.decode('utf-8') with open(filename, 'wb+') as page: page.write(resp_content.encode())if __name__ == '__main__': get_pages(start_url=index_url)简略获取了一下左近的房源，共约600条。 ...

关于爬虫:INFS5730-网络分析

INFS5730 – Social Media and Enterprise 2.0 – 2021 T2Individual Assignment - Social Network Analysis Due on Friday 5pm of week 5 (2nd July 2021). This assignment is worth 10% of your overallcourse mark.Data Collection and cleaning (worth 25% of the available marks)In this assignment, you are required to collect comments and replies to comments fromSony’s PlayStation Blog. As depicted the Figure below, PlayStation users can comment andreply to each others’ comments on Sony’s PlayStation Blog at https://blog.playstation.com/ ...

关于爬虫:数据分析五部曲之数据爬取

最近基金十分火爆，很多本来不投资、不理财人，也开始探讨、参加买基金了。依据投资对象的不同，基金分为股票型基金、债券基金、混合型基金、货币基金。其中股票型基金说白了，就是咱们把钱交给基金公司让它们来帮咱们买股票，毕竟人家业余些嘛所以明天咱们就来看看，这些基金公司都喜爱买那些公司的股票。接下来咱们本人入手，从天天基金网获取基金数据来剖析，文章开端能够取得全量代码地址所用到的技术云数据库IP代理池爬虫多线程开始编写爬虫首先，开始剖析天天基金网的一些数据。通过抓包剖析，可知: ./fundcode_search.js蕴含所有基金代码的数据。依据基金代码，拜访地址: fundgz.1234567.com.cn/js/ + 基金代码 + .js能够获取基金实时净值和估值信息。依据基金代码，拜访地址: fundf10.eastmoney.com/FundArchivesDatas.aspx?type=jjcc&code= + 基金代码 + &topline=10&year=2021&month=3能够获取第一季度该基金所持仓的股票。因为这些地址具备反爬机制，屡次拜访将会失败的状况。所以须要搭建IP代理池，用于反爬。搭建很简略，只须要将proxy_pool这个我的项目跑起来就行了。 # 通过这个办法就能获取代理def get_proxy(): return requests.get("http://127.0.0.1:5010/get/").json()搭建完IP代理池后，咱们开始着手多线程爬取数据的工作。应用多线程，须要思考到数据的读写程序问题。这里应用python中的队列queue存储基金代码，不同线程别离从这个queue中获取基金代码，并拜访指定基金的数据。因为queue的读取和写入是阻塞的，所以可确保该过程不会呈现读取反复和读取失落基金代码的状况。 # 获取所有基金代码fund_code_list = get_fund_code()fund_len = len(fund_code_list)# 创立一个队列fund_code_queue = queue.Queue(fund_len)# 写入基金代码数据到队列for i in range(fund_len): # fund_code_list[i]也是list类型，其中该list中的第0个元素寄存基金代码 fund_code_queue.put(fund_code_list[i][0])当初开始编写获取所有基金的代码。 # 获取所有基金代码def get_fund_code(): ... # 拜访网页接口 req = requests.get("http://fund.eastmoney.com/js/fundcode_search.js", timeout=5, headers=header) # 解析出基金代码存入list中 ... return fund_code_list接下来是从队列中取出基金代码，同时获取基金详情和基金持仓的股票。 # 当队列不为空时while not fund_code_queue.empty(): # 从队列读取一个基金代码 # 读取是阻塞操作 fund_code = fund_code_queue.get() ... try: # 应用该基金代码进行基金详情和股票持仓申请 ...获取基金详情 # 应用代理拜访req = requests.get( "http://fundgz.1234567.com.cn/js/" + str(fund_code) + ".js", proxies={"http": "http://{}".format(proxy)}, timeout=3, headers=header,)# 解析返回数据...获取持仓股票信息 ...

关于爬虫:讲解EEEN-60184-Communications

EEEN 60184Internet and CommunicationsNetworksLaboratory Assignment:Overview, Tasks 1, 2, 3 and 41.0 IntroductionThe laboratory work is concerned with simulating various aspects of wireless sensornetworks. The basic simulations use the Zuniga and Krishnamachari link model, but higherlayers of the protocol stack are also considered.There are 4 parts to the laboratory work: Task 1: Modelling a network link in a pre-specified environment. Task 2: Modelling the connectivity of a network and investigating factors that affecttopology. Task 3: Calculating routes between nodes and the base station Task 4: Discussing the implications of the results of Tasks 1-3 for Transport Layercommunication.1.1 OutcomeYou will be expected to write a short report that presents and discusses your results. Theformat of the report, along with the marking scheme will be specified in a subsequentdocument. However, there will be a limit of 5 pages on the body of the report. This excludesthe front page, contents list, references and appendices.The deadline for uploading the report to Blackboard will be 9.00 am on Monday 1st April2019.Further details will be given shortly.1.2 SoftwareYou will be provided with a number of MATLAB programs to use to obtain results. These willrequire minor modifications to complete some of the tasks, but the amount of programmingneeded is small. However, if you are confident programmer and have reasons for makingmore extensive changes, then you are free to do so.1.3 Laboratory SessionsThese will take place on Wednesday afternoons (20th and 27th March) from 14.00-15.00 inthe Barnes Wallis cluster and attendance is optional. The purpose of the sessions is toenable you to ask questions about the assignment and the software that is provided. Youare expected to complete the assignment work during periods when there are no lectures ortutorials.32.0 Task 1This part the assignment is concerned with utilising the analytical model of Zuniga andKrishnamachari [1] (introduced in lecture 17) to predict the extent of the connected,transitional and disconnected zones associated with transmissions by wireless sensornodes.2.1 BackgroundAs discussed in lecture 17, the well-known model of a fixed (circular) radius radio range witha sharp cut-off between positions where a packet can and cannot be received, has beenfound to be unrealistic. Experimental data indicates the existence of 3 distinct receptionregions surrounding a transmitter. The first, where packet reception rates (PRR) are high (>90%) is known as the connected region. In the next zone the link is unreliable and packetreception rates vary widely (between 90% and 10%); this is known as the transitionalregion). Finally, there exists a region where packet reception is essentially negligible (lessthan 10%) and this is known as the disconnected region.The model of [1] is derived using two models and is a combination of empirical andanalytical approaches. The models are a channel model, based on the log normal law, whichdefines the received signal power as a function of distance from the transmitter, and a radiomodel that predicts PRR as a function of signal-to-noise-ratio (SNR) for a specificcombination of line coding and modulation schemes. This latter model is manipulated untilit can be written in terms of parameters that are known or can be measured/estimated for aparticular radio transceiver.The log normal law (channel model) is:r dB t dB dBdB dB dBP d P PL ddPL d PL d Xd(1)where Pr is the received signal power at distance d from the transmitter, Pt is thetransmitted power, PL is path loss, d0 is a reference distance at which path loss PL(d0) ismeasured. In this assignment d0 is 1m. X is a zero-mean Gaussian random variable whosevariance is 2 which represents fading. is the path loss exponent. See equation (1) in [1].The key elements of the radio model are:is the SNR at distance d from the transmitter, Pn is an estimate of noise, PRR is packetreception ratio (as a fraction), BER is the bit error rate at a SNR of ? (in dBs) and f is the sizeof a packet in bytes. See equation (2) in reference [1].Now PRR is usually written as a function of Eb/N0 (energy received per bit/noise powerdensity). These are difficult parameters to estimate, and so the following familiarrelationship is used:E R E B b b SNR(3)to write BER and PRR as functions of bandwidth B and data rate R, which are known or canbe estimated for a particular radio. The functions f and g depend on line coding andmodulation schemes. See Table V in [1].2.2 What to doA key element of the method is the plot shown in figure 1. This shows the curves from thetwo models superimposed upon one another and which together enable the extent of thethree zones discussed above to be defined. The curves labelled ?, and are derived from the log normal model and show the mean, and upper andlower bounds of received signal strength at a receiver placed at a distance d from thetransmitter. The lines marked ?L+Pn and ?U+Pn represent received signal strength powercorresponding to PRRs of 0.1 and 0.9 respectively. The region defined by the intersections ofthe curves from the two models defines the size of the transition region, as explained in [1].The plot in figure 1 was obtained using a specific set of parameters in both the channels andradio models namely:%channel model constantsPATH_LOSS_EXPONENT = 4.0;D0 = 1.0;PATH_LOSS_D0 = 55.0; %in dBmSHADOWING_VARIANCE = 4.0;TX_POWER = 0; %in dBm ...