Other references¶

Representation of an RGB(A) color.

class Color(r: int, g: int, b: int, a: int = 255)[source]¶

Representation of an 8-bit RGBA color.

static from_str(color: str) → webtraversallibrary.color.Color[source]¶: Creates a color from a “#RRGGBBAA” representation. Alpha is optional.

to_str(with_alpha: bool = False) → str[source]¶: Creates an “#RRGGBBAA” string from this color. Alpha is optional.

to_tuple(with_alpha: bool = False) → Tuple[source]¶: Returns either a 3- or a 4-tuple with the values, depending on the value of with_alpha.

Configuration object for discovery workflow

class Config(cfg: Optional[Iterable[Union[str, pathlib.Path, dict]]] = None)[source]¶

Represents a config object.

static default(cfg: Optional[List[Union[str, pathlib.Path, dict]]] = None) → webtraversallibrary.config.Config[source]¶: Creates a Config object based on all default values

validate()[source]¶: Performs basic sanity checks on the configuration values. Throws AssertionError if something is incorrect.

Contains common library-level errors

exception ElementNotFoundError[source]¶: Any error related to finding elements on the page.

exception Error[source]¶

Base class for all WTL-specific exceptions.

classmethod wrapped(func)[source]¶: Wraps a function in a try-except block where AssertionErrors are logged and reraised as Error.

exception ScrapingError[source]¶: Any error related to scraping/parsing webpages.

exception WebDriverSendError[source]¶: Any error related to custom sending commands to a WebDriver instance

exception WindowClosedError[source]¶: Trying to access a browser instance that was closed by the user

Basic 2-dimensional geometric constructs: points, rectangles, etc.

class Point(x: float, y: float)[source]¶: Represents a point in 2-dimensional plane (e.g. image).

class Rectangle(minima: webtraversallibrary.geometry.Point, maxima: webtraversallibrary.geometry.Point)[source]¶

Represents a rectangle in a 2-dimensional plane.

static bounding_box(rectangles: Sequence[webtraversallibrary.geometry.Rectangle]) → webtraversallibrary.geometry.Rectangle[source]¶: Computes the bounding box of rectangles

property bounds: Tuple[float, float, float, float]¶: Returns min x, min y, max x, max y

property center: webtraversallibrary.geometry.Point¶: Return the midpoint of the rectangle

static centered_at(center: webtraversallibrary.geometry.Point, radius: float) → webtraversallibrary.geometry.Rectangle[source]¶: A new square centered at center with side length radius. (The technically correct term here is Apothem.)

clip(other: webtraversallibrary.geometry.Rectangle) → webtraversallibrary.geometry.Rectangle[source]¶

Return a new rectangle generated by clipping this one by the bounds of other.

Similar to intersection, but clipping non-intersecting rectangles will result in a degenerate rectangle located on one of the edges of other

Returns: New, clipped Rectangle (possibly degenerate)

contains(other: Union[webtraversallibrary.geometry.Point, webtraversallibrary.geometry.Rectangle]) → bool[source]¶

Tests whether the rectangle contains other.

Returns: True if other contained in the rectangle, False otherwise.

static empty() → webtraversallibrary.geometry.Rectangle[source]¶: Returns a rectangle of zero area at origo.

static from_list(*args) → webtraversallibrary.geometry.Rectangle[source]¶: Converts tuple/list (x1, y1, x2, y2) to a Rectangle

static intersection(rectangles: Sequence[webtraversallibrary.geometry.Rectangle]) → webtraversallibrary.geometry.Rectangle[source]¶: Computes the rectangle which is the intersection of a sequence of rectangles. In case the intersection is empty, it returns an empty rectangle.

resized(delta: float) → webtraversallibrary.geometry.Rectangle[source]¶: Returns a resized rectangle, shrinked (inflated for delta<0) by 2*delta in width and in height.

property x: float¶: The x-coordinate of the lower left vertex

property y: float¶: The y-coordinate of the lower-left vertex

Module containing helper functions for graphics-related operations on webdrivers and snapshots.

crop_image(image: PIL.Image.Image, rect: webtraversallibrary.geometry.Rectangle) → PIL.Image.Image[source]¶

Crops the part of the image specified by its rect.

Rectangle specified by rect must lie inside of the image bounds.

draw_rect(image: PIL.Image.Image, rect: webtraversallibrary.geometry.Rectangle, color: webtraversallibrary.color.Color, width: int)[source]¶: Draws a bounding box around the specified rectangle on the image.

draw_text(image: PIL.Image.Image, top_left: webtraversallibrary.geometry.Point, color: webtraversallibrary.color.Color, size: int, text: str)[source]¶: Draws text on a PIL image.

Collection of helper classes used in Workflow.

class ClassifierCollection(classifiers: Iterable[webtraversallibrary.classifiers.Classifier])[source]¶: Helper class for predefined classifiers

class FrameSwitcher(identifier: str, js: webtraversallibrary.javascript.JavascriptWrapper, driver: <module 'selenium.webdriver' from '/home/docs/checkouts/readthedocs.org/user_builds/webtraversallibrary/envs/latest/lib/python3.8/site-packages/selenium/webdriver/__init__.py'>)[source]¶: Helper class for entering and exiting iframes. Raises ElementNotFoundError if an iframe could not be found.

class MonkeyPatches(patches: Optional[Dict[webtraversallibrary.selector.Selector, str]] = None)[source]¶

Helper class for monkeypatches

check(snapshot: webtraversallibrary.snapshot.PageSnapshot, element: webtraversallibrary.snapshot.PageElement) → str[source]¶: If a rule applies for given element for given snapshot, return the most specific value

set_default(patch: str)[source]¶: Equivalent to check(Selector("*"), element) but much faster.

Wrapper functions around JavaScript code to be used from Selenium WebDriver. The main reason to store JS code in files instead of embedding it in Python code is convenience: it is more readable and has better IDE support.

class JavascriptWrapper(driver: selenium.webdriver.remote.webdriver.WebDriver, config: Optional[webtraversallibrary.config.Config] = None)[source]¶

Helper class for executing built-in javascript scripts or custom files and snippets.

annotate(location: webtraversallibrary.geometry.Point, color: webtraversallibrary.color.Color, size: int, text: str, background: webtraversallibrary.color.Color = Color(r=0, g=0, b=0, a=0), viewport: bool = False)[source]¶: Writes text with a given color on the page. Shares an HTML canvas with highlight.

classmethod assemble_script(filenames: Iterable[pathlib.Path]) → str[source]¶: Concatenates the contents of several Javascript files into one, with caching. :param filenames: Path to the JS files either in webtraversallibrary/js or an absolute path.

clear_highlights(viewport: bool = False)[source]¶: Removes all highlights created by highlight().

click_element(selector: webtraversallibrary.selector.Selector)[source]¶: Clicks an element found by the given selector. Note: If more elements can be found, only one will be clicked.

delete_element(selector: webtraversallibrary.selector.Selector)[source]¶: Deletes an element found by the given selector. Note: If more elements can be found, only one will be clicked.

disable_animations()[source]¶

Turns off animation on the page. Works for jQuery by setting a certain flag and for CSS animations by injecting an additional style into the page code.

Mutates the web page.

element_exists(selector: webtraversallibrary.selector.Selector) → bool[source]¶: Returns True if an element exists, otherwise False.

execute_file(filename: Union[pathlib.Path, Iterable[pathlib.Path]], *args, execute_async: bool = False) → Any[source]¶: Execute the JavaScript code in given file and return the result :param filename: Path to the JS file(s) either in webtraversallibrary/js or an absolute path. :param execute_async: if True, will wait until the javascript code has called arguments[arguments.length-1] and will return its input arguments.

execute_script(script: str, *args) → Any[source]¶: Execute the JavaScript code in script and return the result :param script: path to the JS file relative to this package

execute_script_async(script: str, *args) → Any[source]¶: Execute the JavaScript code in script asynchronously and returns the result :param script: path to the JS file relative to this package

fill_text(selector: webtraversallibrary.selector.Selector, value: str)[source]¶: Fills an element as found by a given selector with given text. Note: If more elements can be found, only one will be used.

find_active_elements() → list[source]¶: Uses a couple of heuristics to try and find all clickable elements in the page.

find_iframe_name(identifier: str) → str[source]¶: Looks for an iframe where name, ID, or class equals the identifier, and returns its name. Returns empty string if no matching object was found. :return: iframe name or empty string

find_viewport() → webtraversallibrary.geometry.Rectangle[source]¶: Get the width of the web browser window with content. :return: viewport height in pixels

get_element_metadata() → List[Dict[str, Any]][source]¶

Collects metadata about web page DOM elements: their tags, some of the HTML attributes, position and size on the page, CSS styles and classes, inner text.

Each element on the page is assigned a unique within the scope of the page wtl_uid and has a pointer to the parent DOM element in the wtl_parent_uid field and are not to be confused with id attribute in HTML which is neither unique nor mandatory.

Returns: a list of JSON objects (in their Python dict form) with HTML attributes, additionally calculated properties and unique IDs. Refer to the script code for the keys’ names.

get_full_height() → int[source]¶: Get the full page height, i.e. the height of the document. :return: document height in pixels

hide_position_fixed_elements(elements: Optional[List[str]] = None) → dict[source]¶

Hides page elements that are fixed or sticky (the ones that stay on the page when you scroll) by setting their visibility to “hidden”.

Returns a map from element ids (wtl-uid) to the old visibility values.

Mutates the web page.

highlight(selector: webtraversallibrary.selector.Selector, color: webtraversallibrary.color.Color, fill: bool = False, viewport: bool = False)[source]¶: Highlight an element as found by a given selector with arbitrary color and intensity. Note: If more elements can be found, only one will be highlighted. Shares an HTML canvas with annotate.

is_page_loaded(*_) → bool[source]¶: Applies some heuristics to check if the page is loaded. But since it is in general a hard question to answer, is known to be faulty in some cases.

make_canvas()[source]¶: Create viewport and page canvases and add them to the DOM.

save_mhtml(filename: str)[source]¶: Executes the MHTML saving extension. Saves to the path specified in config.scraping.temp_path. Note: If the file already exists, it will not be overwritten.

scroll_to(x: float, y: float)[source]¶: Scroll the page to given coordinates.

select(selector: webtraversallibrary.selector.Selector, value: str)[source]¶: Select an element of a dropdown (select) element. Note: If more elements can be found, only one will be used.

show_position_fixed_elements(id_to_visibility: dict)[source]¶

Set the specified visibility to the elements with ids listed in id_to_visibility.

The 2nd parameter is expected to be the (possibly accumulated) output of hide_position_fixed_elements.

Mutates the web page (hopefully undoing the changes made by hide_position_fixed_elements)

safe_selenium_method(func)[source]¶: Handles errors thrown in the browser while executing javascript and outputs information to the log. Note: This is a clumsy decorator for instance methods and assumes there is a self.driver member.

Helper function for logging.

setup_logging(log_dir: Optional[pathlib.Path] = None, logging_level: int = 20)[source]¶

Sets up logging: create a directory to write log files to, configure handlers. Sets sane default values for in-house and third-party modules.

Will remove any existing logging handlers with the name “webtraversallibrary” before proceeding.

Parameters

log_dir – directory to write log files to.
logging_level – level of logging you wish to have, accepts number or logging.LEVEL

Helper functions to check versions and existence of installed dependencies.

get_current_os() → webtraversallibrary.driver_check.os_functions.OS[source]¶: Gets the current OS of the machine running.

get_driver_location(driver: webtraversallibrary.driver_check.os_functions.Drivers, os: Optional[webtraversallibrary.driver_check.os_functions.OS] = None) → str[source]¶: Gets the location of the driver.

get_driver_version(driver: webtraversallibrary.driver_check.os_functions.Drivers, os: Optional[webtraversallibrary.driver_check.os_functions.OS] = None) → str[source]¶: Gets the driver version.

is_driver_installed(driver: webtraversallibrary.driver_check.os_functions.Drivers, os: Optional[webtraversallibrary.driver_check.os_functions.OS] = None) → bool[source]¶: Checks if a given driver is installed on the OS.

Module collecting helper functions for common processing tasks, such as muting stdout within a context.

class Alarm(timeout)[source]¶

Helper class to run a timeout thread on Windows

This constructor should always be called with keyword arguments. Arguments are:

group should be None; reserved for future extension when a ThreadGroup class is implemented.

target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.

name is the thread name. By default, a unique name is constructed of the form “Thread-N” where N is a small decimal number.

args is the argument tuple for the target invocation. Defaults to ().

kwargs is a dictionary of keyword arguments for the target invocation. Defaults to {}.

If a subclass overrides the constructor, it must make sure to invoke the base class constructor (Thread.__init__()) before doing anything else to the thread.

run()[source]¶

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

class TimeoutContext(n_seconds, error_class=<class 'TimeoutError'>)[source]¶

Uses signal to raise TimeoutError within the block, if execution went over a specified timeout.

raise_error(signal_num, _)[source]¶: Raises error on timeout.

class cached_property(method)[source]¶: Decorator that caches a property return value and will return it on later calls. Adapted from The Python Cookbok, 2nd edition.

Note

If you want to map different arguments to values, use functools.lru_cache!

Abstraction layer for a screenshot of a site

class Screenshot(name: str, image: PIL.Image.Image)[source]¶

Abstraction layer for a screenshot of a site, allowing for various annotations.

annotate(top_left: webtraversallibrary.geometry.Point, color: webtraversallibrary.color.Color, size: int, text: str)[source]¶: Writes text with a given color on the screenshot.

classmethod capture(name: str, driver: selenium.webdriver.remote.webdriver.WebDriver, scale: float = 1.0, max_page_height: int = 0) → webtraversallibrary.screenshot.Screenshot[source]¶: Creates a snapshot on the given webdriver under certain conditions.

classmethod capture_viewport(name: str, driver: selenium.webdriver.remote.webdriver.WebDriver, scale: float = 1.0) → webtraversallibrary.screenshot.Screenshot[source]¶: Creates a screenshot of the current viewport of a given webdriver. Scales the image by some pixel ratio, if given. Uses PIL as a backend.

highlight(rect: webtraversallibrary.geometry.Rectangle, color: webtraversallibrary.color.Color, text: str = '', width: int = 1)[source]¶: Draws a colored rectangle on the screenshot. Can also annotate with a text below the rectangle, if given.

save(path: pathlib.Path, suffix: str = '')[source]¶: Saves screenshot to given path. Filename consists of the screenshot name and an optional suffix.

property size: webtraversallibrary.geometry.Point¶: Returns a (width, height) Point of the screenshot size in pixels

This module contains heuristics for generating selectors.

class Selector(css: str = '*', xpath: str = '/', iframe: Optional[str] = None)[source]¶

Web element selector based on CSS and XPATH. You may also specify an identifier (name or ID) of an iframe in which the given element is located. The class itself provides no guarantees on whether the selector is unique or even matches anything. The iframe value can be used if this selector refers to an element inside an iframe. Specify as ID or name.

classmethod build(bs4_soup: bs4.BeautifulSoup, target: Union[bs4.element.Tag, int]) → webtraversallibrary.selector.Selector[source]¶: Compute xpath and css of a target in a bs4.BeautifulSoup. Will be verbose. Use a separate generalizer if you want reusable selectors.

Base representation of the current state of a tab.

class View(name: str, snapshot: webtraversallibrary.snapshot.PageSnapshot, actions: webtraversallibrary.actions.Actions = <factory>, tags: typing.Set[str] = <factory>, metadata: typing.Dict[typing.Any, typing.Any] = <factory>)[source]¶

Base representation of the current state of a tab. Holds a snapshot, a list of available actions, and output from the prior classifiers. Note: The metadata field can be added to arbitrarily, and contents will be deeply copied to the next view (do not store too large objects!) If you need large metadata storage, use the workflow.metadata instead.

copy(no_snapshot: bool = False)[source]¶: Creates a shallow copy

Module for different webdrivers considered.

send(driver: selenium.webdriver.remote.webdriver.WebDriver, cmd: str, params: Optional[dict] = None) → int[source]¶: Send command to the webdriver, return resulting status code.

setup_driver(config: webtraversallibrary.config.Config, profile_path: Optional[pathlib.Path] = None, preload_callbacks: Optional[Iterable[pathlib.Path]] = None) → selenium.webdriver.remote.webdriver.WebDriver[source]¶: Creates a WebDriver object with the given configuration. :param: configuration Configuration dictionary :return: A WebDriver instance

Other references¶

Previous topic

This Page