d,lddlZddlZddlmZmZddlmZmZejdZ GddZ y)N)OptionalUnion)LanguageFilter ProbingStates%[a-zA-Z]*[-]+[a-zA-Z]*[^a-zA-Z-]?c eZdZdZej fdeddfdZddZede e fdZ ede e fdZ d e eefdefd Zedefd Zdefd Zed e eefdefdZed e eefdefdZed e eefdefdZy) CharSetProbergffffff? lang_filterreturnNctj|_d|_||_t j t|_y)NT) r DETECTING_stateactiver logging getLogger__name__logger)selfr s 7/usr/lib/python3/dist-packages/chardet/charsetprober.py__init__zCharSetProber.__init__,s0",,  &''1 c.tj|_yN)rr rrs rresetzCharSetProber.reset2s",, rcyrrs r charset_namezCharSetProber.charset_name5srctrNotImplementedErrorrs rlanguagezCharSetProber.language9s!!rbyte_strctrr )rr#s rfeedzCharSetProber.feed=s!!rc|jSr)rrs rstatezCharSetProber.state@s {{rcy)Ngrrs rget_confidencezCharSetProber.get_confidenceDsrbufc4tjdd|}|S)Ns([-])+ )resub)r*s rfilter_high_byte_onlyz#CharSetProber.filter_high_byte_onlyGsff&c2 rct}tj|}|D]C}|j|dd|dd}|j s|dkrd}|j|E|S)u7 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [€-ÿ] marker: everything else [^a-zA-Z€-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. Nr,) bytearrayINTERNATIONAL_WORDS_PATTERNfindallextendisalpha)r*filteredwordsword last_chars rfilter_international_wordsz(CharSetProber.filter_international_wordsLsv; ,33C8 'D OOD"I & RS I$$&9w+> OOI & 'rc*t}d}d}t|jd}t|D]F\}}|dk(r|dz}d}|dk(s||kDr'|s%|j ||||j dd}H|s|j ||d |S) a[ Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Frc>r(?IB$U5)#34$$$rr ) rr-typingrrenumsrrcompiler4r rrrrUs3: "/(bjj8 kkr