Skip to content

Utilities

O módulo de utilitários fornece funções e classes auxiliares usadas em todo o Pydoll.

pydoll.utils

TextExtractor

TextExtractor()

Bases: HTMLParser

HTML parser for text extraction.

Extracts visible text content from an HTML string, excluding the contents of tags specified in _skip_tags.

handle_starttag

handle_starttag(tag, attrs)

Marks the parser to skip content inside tags specified in _skip_tags.

PARAMETER DESCRIPTION
tag

The tag name.

TYPE: str

attrs

A list of (attribute, value) pairs.

TYPE: list

handle_endtag

handle_endtag(tag)

Marks the parser the end of skip tags.

PARAMETER DESCRIPTION
tag

The tag name.

TYPE: str

handle_data

handle_data(data)

Handles text nodes. Adds them to the result unless they are within a skip tag.

PARAMETER DESCRIPTION
data

The text data.

TYPE: str

get_strings

get_strings(strip)

Yields all collected visible text fragments.

PARAMETER DESCRIPTION
strip

Whether to strip leading/trailing whitespace from each fragment.

TYPE: bool

YIELDS DESCRIPTION
str

Visible text fragments.

get_text

get_text(separator, strip)

Returns all visible text.

PARAMETER DESCRIPTION
separator

String inserted between extracted text fragments.

TYPE: str

strip

Whether to strip whitespace from each fragment.

TYPE: bool

RETURNS DESCRIPTION
str

The visible text.

TYPE: str

SOCKS5Forwarder

SOCKS5Forwarder(remote_host, remote_port, username, password, local_host='127.0.0.1', local_port=0)

Local SOCKS5 proxy (no auth) that forwards to a remote authenticated SOCKS5 proxy.

Can be used as an async context manager::

async with SOCKS5Forwarder(...) as fwd:
    # fwd.local_port is now listening
    ...

remote_host instance-attribute

remote_host = remote_host

remote_port instance-attribute

remote_port = remote_port

username instance-attribute

username = username

password instance-attribute

password = password

local_host instance-attribute

local_host = local_host

local_port instance-attribute

local_port = local_port

start async

start()

Start accepting connections on local_host:local_port.

stop async

stop()

Gracefully shut down the server.

serve_forever async

serve_forever()

Block until the server is closed (useful for CLI mode).

UserAgentParser

Stateless parser that extracts consistent metadata from a User-Agent string.

Given a UA string like

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.109 Safari/537.36

It produces all the metadata needed for CDP Emulation.setUserAgentOverride and JavaScript navigator property overrides, ensuring full consistency between HTTP headers and JS properties.

parse staticmethod

parse(user_agent)

Parse a User-Agent string into consistent browser metadata.

PARAMETER DESCRIPTION
user_agent

Full User-Agent string.

TYPE: str

RETURNS DESCRIPTION
ParsedUserAgent

ParsedUserAgent with platform, vendor, appVersion,

ParsedUserAgent

userAgentMetadata, and JS override script.

clean_script_for_analysis

clean_script_for_analysis(script)

Clean JavaScript code by removing comments and string literals.

This helps avoid false positives when analyzing script structure.

PARAMETER DESCRIPTION
script

JavaScript code to clean.

TYPE: str

RETURNS DESCRIPTION
str

Cleaned script with comments and strings removed.

TYPE: str

decode_base64_to_bytes

decode_base64_to_bytes(image)

Decodes a base64 image string to bytes.

PARAMETER DESCRIPTION
image

The base64 image string to decode.

TYPE: str

RETURNS DESCRIPTION
bytes

The decoded image as bytes.

TYPE: bytes

extract_text_from_html

extract_text_from_html(html, separator='', strip=False)

Extracts visible text content from an HTML string.

PARAMETER DESCRIPTION
html

The HTML string to extract text from.

TYPE: str

separator

String inserted between extracted text fragments. Defaults to ''.

TYPE: str DEFAULT: ''

strip

Whether to strip whitespace from text fragments. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
str

The extracted visible text.

TYPE: str

get_browser_ws_address async

get_browser_ws_address(port)

Fetches the WebSocket address for the browser instance.

RETURNS DESCRIPTION
str

The WebSocket address for the browser.

TYPE: str

RAISES DESCRIPTION
NetworkError

If the address cannot be fetched due to network errors or missing data.

InvalidResponse

If the response is not valid JSON.

has_return_outside_function

has_return_outside_function(script)

Check if a JavaScript script has return statements outside of functions.

PARAMETER DESCRIPTION
script

JavaScript code to analyze.

TYPE: str

RETURNS DESCRIPTION
bool

True if script has return outside function, False otherwise.

TYPE: bool

is_script_already_function

is_script_already_function(script)

Check if a JavaScript script is already wrapped in a function.

PARAMETER DESCRIPTION
script

JavaScript code to analyze.

TYPE: str

RETURNS DESCRIPTION
bool

True if script is already a function, False otherwise.

TYPE: bool

normalize_synthetic_xpath

normalize_synthetic_xpath(selector)

Normalize synthetic XPath selector produced by the builder.

Converts selectors of the form //*[@xpath="..."] back into the original XPath string between the quotes. Returns the input unchanged if the pattern is not present or cannot be parsed safely.

PARAMETER DESCRIPTION
selector

The selector string that may contain the synthetic XPath format.

TYPE: str

RETURNS DESCRIPTION
str

The normalized original XPath or the input selector if no normalization applies.

TYPE: str

validate_browser_paths

validate_browser_paths(paths)

Validates potential browser executable paths and returns the first valid one.

Checks a list of possible browser binary locations to find an existing, executable browser. This is used by browser-specific subclasses to locate the browser executable when no explicit binary path is provided.

PARAMETER DESCRIPTION
paths

List of potential file paths to check for the browser executable. These should be absolute paths appropriate for the current OS.

TYPE: list[str]

RETURNS DESCRIPTION
str

The first valid browser executable path found.

TYPE: str

RAISES DESCRIPTION
InvalidBrowserPath

If the browser executable is not found at the path.