Main classes

Web Traversal Library (WTL) Provides a bottom abstraction layer for automatic web workflows.

An action is an abstraction of some website or browser interaction, often implemented as a javascript snippet or webdriver API execution applicable for a given view.

Action instances can be incomplete, i.e. they are not initialized with all fields. Call them with the missing attributes to replace those.

class Abort[source]

Stops any future progress on this tab and will set an aborted flag on the given tab. The tab will not be snapshoted in the future. If all tabs have received an Abort call, the workflow will stop. Note! If you’re using multiple tabs, this action has highest priority.

class Action[source]

Base class for all actions. Do not use, refer instead to ElementAction or PageAction.

Each action must provide an execute method that performs the required logic.

class Actions(iterable=(), /)[source]

Helper class for a list of actions

by_element(element: webtraversallibrary.snapshot.PageElement) webtraversallibrary.actions.Actions[source]

Returns all actions (ElementAction) that act upon the given element.

by_raw_score(name: str, limit: float = 0.0) webtraversallibrary.actions.Actions[source]

Returns all actions with the given raw score (output from a classifier before scaling) greater than the given limit.

by_score(name: str, limit: float = 0.0) webtraversallibrary.actions.Actions[source]

Returns all actions with the given score (metadata entry) greater than the given limit.

by_selector(selector: webtraversallibrary.selector.Selector) webtraversallibrary.actions.Actions[source]

Queries the page by the given selector. If at least one element is found, return all elements equal to one of those.

by_type(tag: type) webtraversallibrary.actions.Actions[source]

Returns all actions of the given type.

sort_by(name: Optional[str] = None, reverse: bool = False) webtraversallibrary.actions.Actions[source]

Sorts by a certain action (raw) score. If given name does not exist the element gets (raw) score 0.

unique() webtraversallibrary.actions.Action[source]

Checks if exactly one element exists, if so returns it. Throws AssertionError otherwise

class Annotate(location: webtraversallibrary.geometry.Point, color: webtraversallibrary.color.Color, size: int, text: str, background: webtraversallibrary.color.Color = Color(r=0, g=0, b=0, a=0), viewport: Optional[bool] = None)[source]

Writes text on a given page by calling workflow.js.annotate(…). viewport refers to drawing on a viewport canvas. If None, uses default value from config.

class Clear(viewport: Optional[bool] = None)[source]

Clears all highlights and annotations. viewport refers to drawing on a floating viewport-sized canvas. If None, uses default value from config.

class Click(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None)[source]

Simulates a click on the contained element. If it isn’t clickable, nothing happens.

class ElementAction(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None)[source]

Base class for all actions that execute on a specific element. Can be initialised with a PageElement or a css selector string.

transformed_to_element(elements: webtraversallibrary.snapshot.Elements) webtraversallibrary.actions.ElementAction[source]

Modifies this action with a PageElement corresponding to the stored selector

class FillText(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None, text: str = '')[source]

Fills a string by setting the text value in the contained element. If it isn’t a text field, anything can happen.

class Highlight(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None, color: webtraversallibrary.color.Color = Color(r=255, g=179, b=199, a=255), fill: bool = False, viewport: Optional[bool] = None)[source]

Highlights an element by calling workflow.js.highlight(…) viewport refers to drawing on a floating viewport-sized canvas. If None, uses default value from config.

class Navigate(url: str = '')[source]

Navigates to a new URL and waits for the page to load. If the URL is invalid, you may end up on the browser’s error page.

class PageAction[source]

Base class for all actions that do not execute on a specific element.

class Refresh[source]

Triggers a refresh of the current page. Note that the following snapshot may not have wtl_uid that map equally to the previous state.

class Remove(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None)[source]

Removes the given element from the DOM.

class Revert(view_index: int = 0)[source]

Reverts the state of the Workflow to a previous point in time. In effect, resets the underlying web driver and then replays all actions leading up to the given view_index. If 0, just perform the initial action of Navigate to the initial URL.

class ScrollTo(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None)[source]

Scrolls the current page to center the given element vertically.

class Select(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None, value: str = '')[source]

Selects the given value on a <select> dropdown element. If it isn’t a select element, anything can happen.

class Wait(seconds: float = 0)[source]

Calls time.sleep() with the given seconds argument.

class WaitForElement(selector: webtraversallibrary.selector.Selector, seconds: float = 1.0)[source]

Checks to see if element at given selector exists on the page. Keeps trying indefinitely with a given interval until it succeeds.

class WaitForUser[source]

Waits until the Enter key is pressed in the terminal.

Base classes for prior classifiers.

class ActiveElementFilter(name: str = 'is_active', enabled: bool = True, callback: typing.Callable = <function _active_element_filter_func>, action: typing.Optional[webtraversallibrary.actions.Action] = None, highlight: typing.Union[float, bool] = False, mode: webtraversallibrary.classifiers.ScalingMode = ScalingMode.CLAMP, highlight_color: webtraversallibrary.color.Color = Color(r=90, g=25, b=17, a=255), subset: typing.Union[str, typing.Iterable[str]] = 'all', result_type: type = <class 'bool'>)[source]

Returns all elements that are considered active, i.e. interactable in some way. Will also add a boolean is_active field to every element’s metadata.

result_type

alias of bool

class Classifier(name: str, enabled: bool = True, callback: Optional[Any] = None)[source]

Base class for all prior classifiers. Do not use, refer instead to ElementClassifier or ViewClassifier.

class ElementClassifier(name: str, enabled: bool = True, callback: typing.Optional[typing.Any] = None, action: typing.Optional[webtraversallibrary.actions.Action] = None, highlight: typing.Union[float, bool] = False, mode: webtraversallibrary.classifiers.ScalingMode = ScalingMode.CLAMP, highlight_color: webtraversallibrary.color.Color = Color(r=90, g=25, b=17, a=255), subset: typing.Union[str, typing.Iterable[str]] = 'all', result_type: type = <class 'float'>)[source]

Classifies a set of elements. The callback will receive a list of elements that have have tags for all tags given in subset.

The callback either returns a sublist of elements, or a list of tuples mapping element to a numeric score.

If the callback is doing multi-class prediction, then the output should be a dictionary mapping class name to a sublist or list of tuples described above. The prediction results would be stored as <classifier_name>__<class name>.

If highlight is True, highlight every element returned by this classifier. If highlight is a float x, highlight every element with a score larger than x. If highlight is an int N, highlight the top N scoring elemnets.

result_type

alias of float

class ScalingMode(value)[source]

No scaling at all, preserve the raw score

CLAMP = 2

Map values to the [0,1] range

IDENTITY = 1

Any values less than 0 are set to 0, any values larger than 1 are set to 1

LINEAR = 3

Map the logarithm of the values to the [0,1] range

scale(values: Sequence[float]) List[float][source]

Scales a list of scores according to a given mode

class ViewClassifier(name: str, enabled: bool = True, callback: Optional[Any] = None)[source]

Classifies a given view. The callback will receive a view and return an iterable of string tags.

Contains built-in simple, common goals.

FOREVER(*_, **__)[source]

Will always return False (continue).

class N_STEPS(n: int)[source]

Will return False (continue) exactly n times, and then return True.

class ONCE[source]

Will return False (continue) exactly once.

Contains built-in simple, common policies and helper decorators for building custom policies.

DUMMY(_workflow, view)[source]

Never takes any action.

RANDOM(_workflow, view)[source]

Picks a completely random (available) action for every view.

multi_tab_coroutine(policy)[source]

Decorator for simplifying policies given by a generator/coroutine on multiple tabs. Allows you to yield from the policy and pass the reference to the function/coroutine as you would for a normal method, without instantiating an object.

single_tab(func)[source]

Decorator for simplifying policies or goals when just one tab is in use. Simplifies the API, the views (dict) argument is replaced by the single (first) view instance. Also, the policy can now just return action (or list of actions) without a dict mapping.

single_tab_coroutine(policy)[source]

Decorator for simplifying policies given by a generator/coroutine on a single tab. Interface will be the same as when using @single_tab for normal functions. Allows you to yield from the policy and pass the reference to the function/coroutine as you would for a normal method, without instantiating an object.

Module for scraping a website and saving contents to file.

class Scraper(driver: selenium.webdriver.remote.webdriver.WebDriver, config: webtraversallibrary.config.Config, hide_sticky: bool = True, postload_callbacks: Optional[List[Callable]] = None)[source]

Used to create web page snapshots using a WebDriver instance.

capture_screenshot(name: str, max_page_height: int = 0) webtraversallibrary.screenshot.Screenshot[source]

Captures the screenshot of the current rendering in the browser window.

get_page_as_mhtml() bytes[source]

Gets an MHTML representation of the current page, returns it as a bytestring. MHTML must be enabled in the browser configuration, otherwise returns None.

navigate(url: str)[source]

Navigate the scraper’s internal webdriver to the URL. Blocks until page is loaded.

refresh()[source]

Triggers a refresh of the page. Blocks until page is loaded.

scrape_current_page() webtraversallibrary.snapshot.PageSnapshot[source]

Scrape the page currently open in the driver,

wait_until_loaded(timeout: Optional[int] = None)[source]

Waits on the webdriver instance to finish loading a page before returning.

Note

Because of the unending imagination of javascript devs, there may be cases where this function returns before the timeout although the page hasn’t loaded.

Abstraction layer for interactions with a browser instance.

class Window(config: webtraversallibrary.config.Config, preload_callbacks: Optional[Iterable[pathlib.Path]] = None, postload_callbacks: Optional[List[Callable]] = None)[source]

Owns a webdriver, scraper and javascript wrapper. Handles browser instance logic.

close_tab()[source]

Closes the current active tab.

create_tab(name: str, url: str = 'about:blank')[source]

Open a new tab with a given name (must be unique for this Window). If url argument is given, saves this value that can be retrieved using the navigation() method later. The tab will not automatically navigate to the given adress. Use a Scraper or interact directly with the driver.

property driver

Checks if the driver is attached, and if so returns a reference to it.

ensure_running()[source]

Tries fetching attached window handles. Raises WindowClosedError if this fails.

is_closed(tab)[source]

Returns True if the tab has been closed

property navigation

Returns URL to navigate to for the current active tab, if it exists. Removes the entry afterwards, to two subsequent calls need not be equal.

property open_tabs

Returns a list of all open tab names

quit()[source]

Terminates the browser and associated driver of this instance. Do not use this window afterwards.

set_tab(name: str)[source]

Sets the current active tab to the one listed under the given name. Throws AssertionError if no such tab exists.

Warning

If the tab has been closed, interactions may lead to weird behaviour.

property tabs

Returns a list of all tab names

The Workflow is the main entry point for using the Web Traversal Library.

class Workflow(url: typing.Union[str, typing.Dict[str, str], typing.Dict[str, typing.Dict[str, str]]], policy: typing.Callable, output: typing.Optional[pathlib.Path] = None, config: typing.Optional[webtraversallibrary.config.Config] = None, goal: typing.Callable = <function FOREVER>, classifiers: typing.Optional[typing.List[webtraversallibrary.classifiers.Classifier]] = None, patches: typing.Optional[typing.Dict[webtraversallibrary.selector.Selector, str]] = None)[source]

The Workflow is the main entry point for using the Web Traversal Library. It will handle setup and teardown of all helper classes and appropriate use of the configuration object. Note that you can “run a workflow” manually by creating Window and Scraper objects manually, however this is not recommended for general use.

Create a Workflow and reset it. :param url: Single string, dict mapping tab name to URL, or dict mapping window name to tab: url dict. Each tab name must be unique, even among different windows! :param policy: Function taking workflow and views, returning an action or a list of actions per tab/view. :param output: Path for storing local data (if needed). :param config: Configuration object for this instance. :param goal: Called before each policy call, will halt the workflow if it returns True. :param classifiers: List of classifiers to run on every snapshots. :param patches: A dictionary of selectors to monkeypatch to other destinations.

property aborted: bool

Returns True if the current tab has been closed.

create_window(name: str) webtraversallibrary.window.Window[source]

Opens a new browser window and adds it to this workflow. Returns the new window.

property current_tab: str

Returns the name of the current tab

property current_window: webtraversallibrary.window.Window

Returns the window object for the current tab.

property driver: webdriver

Returns the WebDriver instance associated to the current window.

frame(identifier: str) webtraversallibrary.helpers.FrameSwitcher[source]

Returns a context manager for entering and exiting iframes. See FrameSwitcher for more details.

property history: List[webtraversallibrary.view.View]

Returns a history of views for the current tab. The view stores previous_action and next_action in its metadata for future resurrection of the workflow.

property js: webtraversallibrary.javascript.JavascriptWrapper

Returns a JavascriptWrapper associated to the current window.

property latest_view: webtraversallibrary.view.View

Returns the latest View taken from the current tab.

property open_tabs

Returns a list of all open tab names.

property output_path: pathlib.Path

Returns path to output directory for current tab and iteration

property post_processing_output_path: pathlib.Path

Returns path to output directory for post processing output. The Workflow does not put anything here itself by default, but this is provided as convenience to the user.

quit()[source]

Cleans up all windows. Call this after you are done! Do not use again after this.

reset()[source]

Resets the workflow. Does not clear any history.

reset_to(view_index: int)[source]

Resets the Workflow and replays the first view_index actions. History in memory will be mutated. Because this resets the loop_idx variable to match, saved output will override previous output.

run()[source]

Runs the workflow loop! Intializes the tabs to the starting URLs and then calls run_once() in a loop until the goal function returns True or all tabs have stopped.

run_once()[source]

Runs a single iteration of the WTL flow, i.e. snapshots, runs classifiers, checks goal, computes actions, calls policy, executes actions. Check loop_idx member attribute for number of iterations.

Note

This does not initialize the tabs to their starting URLs. Normally, use run() instead.

Returns

The boolean output from the goal function.

property scraper: webtraversallibrary.scraper.Scraper

Provides a Scraper instance associated to the current window.

smart_scroll_to(element: webtraversallibrary.geometry.Rectangle)[source]

Uses js.scroll_to in a slightly smarter way to minimize required scrolls and vertically center elements of interest.

property success: bool

Returns True if there are any open (i.e. not-cancelled) tabs.

tab(name: str) webtraversallibrary.workflow.Workflow[source]

Sets the current tab to the given name.

property tabs: List[str]

Returns a list of all tab names.

view(view: webtraversallibrary.view.View) webtraversallibrary.workflow.Workflow[source]

Sets the current tab to the one where given view was taken from.

window(name: str)[source]

Returns the window instance with the name provided when created with create_window().

property windows

Returns a list of all window instances.