o ½w[eÐ$ã@s\dZddlZddlZddlZdgZe dd¡ZGdd„dƒZGdd„dƒZ Gd d „d ƒZ dS) a% robotparser.py Copyright (C) 2000 Bastian Kleineidam You can choose between two licenses when using this package: 1) GNU GPLv2 2) PSF license for Python 2.2 The robots.txt Exclusion Protocol is implemented as specified in http://www.robotstxt.org/norobots-rfc.txt éNÚRobotFileParserÚ RequestRatezrequests secondsc@sreZdZdZddd„Zdd„Zdd„Zd d „Zd d „Zd d„Z dd„Z dd„Z dd„Z dd„Z dd„Zdd„ZdS)rzs This class provides a set of methods to read, parse and answer questions about a single robots.txt file. ÚcCs2g|_g|_d|_d|_d|_| |¡d|_dS)NFr)ÚentriesÚsitemapsÚ default_entryÚ disallow_allÚ allow_allÚset_urlÚ last_checked©ÚselfÚurl©rú)/usr/lib/python3.10/urllib/robotparser.pyÚ__init__s  zRobotFileParser.__init__cCs|jS)z·Returns the time the robots.txt file was last fetched. This is useful for long-running web spiders that need to check for new robots.txt files periodically. )r ©r rrrÚmtime%szRobotFileParser.mtimecCsddl}| ¡|_dS)zYSets the time the robots.txt file was last fetched to the current time. rN)Útimer )r rrrrÚmodified.szRobotFileParser.modifiedcCs&||_tj |¡dd…\|_|_dS)z,Sets the URL referring to a robots.txt file.ééN)rÚurllibÚparseÚurlparseÚhostÚpathr rrrr 6s zRobotFileParser.set_urlc Cs´z tj |j¡}Wn@tjjyI}z2|jdvrd|_n|jdkr0|jdkr>d|_WYd}~dSWYd}~dSWYd}~dSWYd}~dSd}~ww|  ¡}|  |  d¡  ¡¡dS)z4Reads the robots.txt URL and feeds it to the parser.)i‘i“TiiôNzutf-8) rÚrequestÚurlopenrÚerrorÚ HTTPErrorÚcoderr ÚreadrÚdecodeÚ splitlines)r ÚfÚerrÚrawrrrr";s ÿÿ€ýzRobotFileParser.readcCs2d|jvr|jdur||_dSdS|j |¡dS©NÚ*)Ú useragentsrrÚappend)r ÚentryrrrÚ _add_entryHs   þzRobotFileParser._add_entrycCsJd}tƒ}| ¡|D] }|s(|dkrtƒ}d}n|dkr(| |¡tƒ}d}| d¡}|dkr7|d|…}| ¡}|s>q | dd¡}t|ƒdkr|d ¡ ¡|d<tj   |d ¡¡|d<|ddkr~|dkrs| |¡tƒ}|j   |d¡d}q |ddkr–|dkr•|j   t|dd ƒ¡d}q |dd kr®|dkr­|j   t|dd ƒ¡d}q |dd krÊ|dkrÉ|d ¡ ¡rÇt|dƒ|_d}q |dd kr|dkr|d d¡}t|ƒdkr|d ¡ ¡r|d ¡ ¡rtt|dƒt|dƒƒ|_d}q |ddkr|j  |d¡q |dkr#| |¡dSdS)z”Parse the input lines from a robots.txt file. We allow that a user-agent: line is not preceded by one or more blank lines. rréú#Nú:z user-agentÚdisallowFÚallowTz crawl-delayz request-rateú/Úsitemap)ÚEntryrr-ÚfindÚstripÚsplitÚlenÚlowerrrÚunquoter*r+Ú rulelinesÚRuleLineÚisdigitÚintÚdelayrÚreq_rater)r ÚlinesÚstater,ÚlineÚiÚnumbersrrrrQsv         € € €  ÿ€€ ÿzRobotFileParser.parsecCs |jrdS|jr dS|jsdStj tj |¡¡}tj dd|j|j |j |j f¡}tj  |¡}|s3d}|j D]}| |¡rD| |¡Sq6|jrN|j |¡SdS)z=using the parsed robots.txt decide if useragent can fetch urlFTrr3)rr r rrrr;Ú urlunparserÚparamsÚqueryÚfragmentÚquoterÚ applies_toÚ allowancer)r Ú useragentrÚ parsed_urlr,rrrÚ can_fetchšs( ÿ   ÿ zRobotFileParser.can_fetchcCó>| ¡sdS|jD] }| |¡r|jSq |jr|jjSdS©N)rrrLr@r©r rNr,rrrÚ crawl_delay·ó   ÿzRobotFileParser.crawl_delaycCrQrR)rrrLrArrSrrrÚ request_rateÁrUzRobotFileParser.request_ratecCs|jsdS|jSrR)rrrrrÚ site_mapsËszRobotFileParser.site_mapscCs,|j}|jdur||jg}d tt|ƒ¡S)Nz )rrÚjoinÚmapÚstr)r rrrrÚ__str__Ðs  zRobotFileParser.__str__N)r)Ú__name__Ú __module__Ú __qualname__Ú__doc__rrrr r"r-rrPrTrVrWr[rrrrrs     I  c@s(eZdZdZdd„Zdd„Zdd„ZdS) r=zoA rule line is a single "Allow:" (allowance==True) or "Disallow:" (allowance==False) followed by a path.cCs<|dkr|sd}tj tj |¡¡}tj |¡|_||_dS)NrT)rrrGrrKrrM)r rrMrrrrÚs  zRuleLine.__init__cCs|jdkp | |j¡Sr()rÚ startswith)r ÚfilenamerrrrLâszRuleLine.applies_tocCs|jrdndd|jS)NÚAllowÚDisallowz: )rMrrrrrr[åszRuleLine.__str__N)r\r]r^r_rrLr[rrrrr=×s  r=c@s0eZdZdZdd„Zdd„Zdd„Zdd „Zd S) r5z?An entry has one or more user-agents and zero or more rulelinescCsg|_g|_d|_d|_dSrR)r*r<r@rArrrrrës zEntry.__init__cCs‚g}|jD] }| d|›¡q|jdur| d|j›¡|jdur3|j}| d|j›d|j›¡| tt|j ƒ¡d  |¡S)Nz User-agent: z Crawl-delay: zRequest-rate: r3Ú ) r*r+r@rAÚrequestsÚsecondsÚextendrYrZr<rX)r ÚretÚagentÚraterrrr[ñs    z Entry.__str__cCsF| d¡d ¡}|jD]}|dkrdS| ¡}||vr dSq dS)z2check if this entry applies to the specified agentr3rr)TF)r8r:r*)r rNrirrrrLýs ÿzEntry.applies_tocCs$|jD] }| |¡r|jSqdS)zZPreconditions: - our agent applies to this entry - filename is URL decodedT)r<rLrM)r rarDrrrrM s   ÿzEntry.allowanceN)r\r]r^r_rr[rLrMrrrrr5és  r5) r_Ú collectionsÚ urllib.parserÚurllib.requestÚ__all__Ú namedtuplerrr=r5rrrrÚs  B