Main classes¶
Web Traversal Library (WTL) Provides a bottom abstraction layer for automatic web workflows.
An action is an abstraction of some website or browser interaction, often implemented as a javascript snippet or webdriver API execution applicable for a given view.
Action instances can be incomplete, i.e. they are not initialized with all fields. Call them with the missing attributes to replace those.
- class Abort[source]¶
Stops any future progress on this tab and will set an aborted flag on the given tab. The tab will not be snapshoted in the future. If all tabs have received an Abort call, the workflow will stop. Note! If you’re using multiple tabs, this action has highest priority.
- class Action[source]¶
Base class for all actions. Do not use, refer instead to
ElementAction
orPageAction
.Each action must provide an
execute
method that performs the required logic.
- class Actions(iterable=(), /)[source]¶
Helper class for a list of actions
- by_element(element: webtraversallibrary.snapshot.PageElement) webtraversallibrary.actions.Actions [source]¶
Returns all actions (ElementAction) that act upon the given element.
- by_raw_score(name: str, limit: float = 0.0) webtraversallibrary.actions.Actions [source]¶
Returns all actions with the given raw score (output from a classifier before scaling) greater than the given limit.
- by_score(name: str, limit: float = 0.0) webtraversallibrary.actions.Actions [source]¶
Returns all actions with the given score (metadata entry) greater than the given limit.
- by_selector(selector: webtraversallibrary.selector.Selector) webtraversallibrary.actions.Actions [source]¶
Queries the page by the given selector. If at least one element is found, return all elements equal to one of those.
- by_type(tag: type) webtraversallibrary.actions.Actions [source]¶
Returns all actions of the given type.
- sort_by(name: Optional[str] = None, reverse: bool = False) webtraversallibrary.actions.Actions [source]¶
Sorts by a certain action (raw) score. If given name does not exist the element gets (raw) score 0.
- unique() webtraversallibrary.actions.Action [source]¶
Checks if exactly one element exists, if so returns it. Throws AssertionError otherwise
- class Annotate(location: webtraversallibrary.geometry.Point, color: webtraversallibrary.color.Color, size: int, text: str, background: webtraversallibrary.color.Color = Color(r=0, g=0, b=0, a=0), viewport: Optional[bool] = None)[source]¶
Writes text on a given page by calling workflow.js.annotate(…). viewport refers to drawing on a viewport canvas. If None, uses default value from config.
- class Clear(viewport: Optional[bool] = None)[source]¶
Clears all highlights and annotations. viewport refers to drawing on a floating viewport-sized canvas. If None, uses default value from config.
- class Click(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None)[source]¶
Simulates a click on the contained element. If it isn’t clickable, nothing happens.
- class ElementAction(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None)[source]¶
Base class for all actions that execute on a specific element. Can be initialised with a
PageElement
or a css selector string.- transformed_to_element(elements: webtraversallibrary.snapshot.Elements) webtraversallibrary.actions.ElementAction [source]¶
Modifies this action with a PageElement corresponding to the stored selector
- class FillText(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None, text: str = '')[source]¶
Fills a string by setting the text value in the contained element. If it isn’t a text field, anything can happen.
- class Highlight(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None, color: webtraversallibrary.color.Color = Color(r=255, g=179, b=199, a=255), fill: bool = False, viewport: Optional[bool] = None)[source]¶
Highlights an element by calling workflow.js.highlight(…) viewport refers to drawing on a floating viewport-sized canvas. If None, uses default value from config.
Navigates to a new URL and waits for the page to load. If the URL is invalid, you may end up on the browser’s error page.
- class Refresh[source]¶
Triggers a refresh of the current page. Note that the following snapshot may not have wtl_uid that map equally to the previous state.
- class Remove(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None)[source]¶
Removes the given element from the DOM.
- class Revert(view_index: int = 0)[source]¶
Reverts the state of the Workflow to a previous point in time. In effect, resets the underlying web driver and then replays all actions leading up to the given view_index. If 0, just perform the initial action of
Navigate
to the initial URL.
- class ScrollTo(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None)[source]¶
Scrolls the current page to center the given element vertically.
- class Select(target: Optional[Union[webtraversallibrary.snapshot.PageElement, webtraversallibrary.selector.Selector]] = None, value: str = '')[source]¶
Selects the given value on a <select> dropdown element. If it isn’t a select element, anything can happen.
- class WaitForElement(selector: webtraversallibrary.selector.Selector, seconds: float = 1.0)[source]¶
Checks to see if element at given selector exists on the page. Keeps trying indefinitely with a given interval until it succeeds.
Base classes for prior classifiers.
- class ActiveElementFilter(name: str = 'is_active', enabled: bool = True, callback: typing.Callable = <function _active_element_filter_func>, action: typing.Optional[webtraversallibrary.actions.Action] = None, highlight: typing.Union[float, bool] = False, mode: webtraversallibrary.classifiers.ScalingMode = ScalingMode.CLAMP, highlight_color: webtraversallibrary.color.Color = Color(r=90, g=25, b=17, a=255), subset: typing.Union[str, typing.Iterable[str]] = 'all', result_type: type = <class 'bool'>)[source]¶
Returns all elements that are considered active, i.e. interactable in some way. Will also add a boolean is_active field to every element’s metadata.
- result_type¶
alias of
bool
- class Classifier(name: str, enabled: bool = True, callback: Optional[Any] = None)[source]¶
Base class for all prior classifiers. Do not use, refer instead to
ElementClassifier
orViewClassifier
.
- class ElementClassifier(name: str, enabled: bool = True, callback: typing.Optional[typing.Any] = None, action: typing.Optional[webtraversallibrary.actions.Action] = None, highlight: typing.Union[float, bool] = False, mode: webtraversallibrary.classifiers.ScalingMode = ScalingMode.CLAMP, highlight_color: webtraversallibrary.color.Color = Color(r=90, g=25, b=17, a=255), subset: typing.Union[str, typing.Iterable[str]] = 'all', result_type: type = <class 'float'>)[source]¶
Classifies a set of elements. The callback will receive a list of elements that have have tags for all tags given in subset.
The callback either returns a sublist of elements, or a list of tuples mapping element to a numeric score.
If the callback is doing multi-class prediction, then the output should be a dictionary mapping class name to a sublist or list of tuples described above. The prediction results would be stored as <classifier_name>__<class name>.
If highlight is True, highlight every element returned by this classifier. If highlight is a float x, highlight every element with a score larger than x. If highlight is an int N, highlight the top N scoring elemnets.
- result_type¶
alias of
float
- class ScalingMode(value)[source]¶
No scaling at all, preserve the raw score
- CLAMP = 2¶
Map values to the [0,1] range
- IDENTITY = 1¶
Any values less than 0 are set to 0, any values larger than 1 are set to 1
- LINEAR = 3¶
Map the logarithm of the values to the [0,1] range
- class ViewClassifier(name: str, enabled: bool = True, callback: Optional[Any] = None)[source]¶
Classifies a given view. The callback will receive a view and return an iterable of string tags.
Contains built-in simple, common goals.
Contains built-in simple, common policies and helper decorators for building custom policies.
- multi_tab_coroutine(policy)[source]¶
Decorator for simplifying policies given by a generator/coroutine on multiple tabs. Allows you to yield from the policy and pass the reference to the function/coroutine as you would for a normal method, without instantiating an object.
- single_tab(func)[source]¶
Decorator for simplifying policies or goals when just one tab is in use. Simplifies the API, the views (dict) argument is replaced by the single (first) view instance. Also, the policy can now just return action (or list of actions) without a dict mapping.
- single_tab_coroutine(policy)[source]¶
Decorator for simplifying policies given by a generator/coroutine on a single tab. Interface will be the same as when using @single_tab for normal functions. Allows you to yield from the policy and pass the reference to the function/coroutine as you would for a normal method, without instantiating an object.
Module for scraping a website and saving contents to file.
- class Scraper(driver: selenium.webdriver.remote.webdriver.WebDriver, config: webtraversallibrary.config.Config, hide_sticky: bool = True, postload_callbacks: Optional[List[Callable]] = None)[source]¶
Used to create web page snapshots using a WebDriver instance.
- capture_screenshot(name: str, max_page_height: int = 0) webtraversallibrary.screenshot.Screenshot [source]¶
Captures the screenshot of the current rendering in the browser window.
- get_page_as_mhtml() bytes [source]¶
Gets an MHTML representation of the current page, returns it as a bytestring. MHTML must be enabled in the browser configuration, otherwise returns None.
Navigate the scraper’s internal webdriver to the URL. Blocks until page is loaded.
Abstraction layer for interactions with a browser instance.
- class Window(config: webtraversallibrary.config.Config, preload_callbacks: Optional[Iterable[pathlib.Path]] = None, postload_callbacks: Optional[List[Callable]] = None)[source]¶
Owns a webdriver, scraper and javascript wrapper. Handles browser instance logic.
- create_tab(name: str, url: str = 'about:blank')[source]¶
Open a new tab with a given name (must be unique for this Window). If url argument is given, saves this value that can be retrieved using the
navigation()
method later. The tab will not automatically navigate to the given adress. Use aScraper
or interact directly with the driver.
- property driver¶
Checks if the driver is attached, and if so returns a reference to it.
- ensure_running()[source]¶
Tries fetching attached window handles. Raises WindowClosedError if this fails.
Returns URL to navigate to for the current active tab, if it exists. Removes the entry afterwards, to two subsequent calls need not be equal.
- property open_tabs¶
Returns a list of all open tab names
- quit()[source]¶
Terminates the browser and associated driver of this instance. Do not use this window afterwards.
- set_tab(name: str)[source]¶
Sets the current active tab to the one listed under the given name. Throws AssertionError if no such tab exists.
Warning
If the tab has been closed, interactions may lead to weird behaviour.
- property tabs¶
Returns a list of all tab names
The Workflow is the main entry point for using the Web Traversal Library.
- class Workflow(url: typing.Union[str, typing.Dict[str, str], typing.Dict[str, typing.Dict[str, str]]], policy: typing.Callable, output: typing.Optional[pathlib.Path] = None, config: typing.Optional[webtraversallibrary.config.Config] = None, goal: typing.Callable = <function FOREVER>, classifiers: typing.Optional[typing.List[webtraversallibrary.classifiers.Classifier]] = None, patches: typing.Optional[typing.Dict[webtraversallibrary.selector.Selector, str]] = None)[source]¶
The Workflow is the main entry point for using the Web Traversal Library. It will handle setup and teardown of all helper classes and appropriate use of the configuration object. Note that you can “run a workflow” manually by creating Window and Scraper objects manually, however this is not recommended for general use.
Create a Workflow and reset it. :param url: Single string, dict mapping tab name to URL, or dict mapping window name to tab: url dict. Each tab name must be unique, even among different windows! :param policy: Function taking workflow and views, returning an action or a list of actions per tab/view. :param output: Path for storing local data (if needed). :param config: Configuration object for this instance. :param goal: Called before each policy call, will halt the workflow if it returns True. :param classifiers: List of classifiers to run on every snapshots. :param patches: A dictionary of selectors to monkeypatch to other destinations.
- property aborted: bool¶
Returns True if the current tab has been closed.
- create_window(name: str) webtraversallibrary.window.Window [source]¶
Opens a new browser window and adds it to this workflow. Returns the new window.
- property current_tab: str¶
Returns the name of the current tab
- property current_window: webtraversallibrary.window.Window¶
Returns the window object for the current tab.
- property driver: webdriver¶
Returns the WebDriver instance associated to the current window.
- frame(identifier: str) webtraversallibrary.helpers.FrameSwitcher [source]¶
Returns a context manager for entering and exiting iframes. See FrameSwitcher for more details.
- property history: List[webtraversallibrary.view.View]¶
Returns a history of views for the current tab. The view stores previous_action and next_action in its metadata for future resurrection of the workflow.
- property js: webtraversallibrary.javascript.JavascriptWrapper¶
Returns a
JavascriptWrapper
associated to the current window.
- property latest_view: webtraversallibrary.view.View¶
Returns the latest
View
taken from the current tab.
- property open_tabs¶
Returns a list of all open tab names.
- property output_path: pathlib.Path¶
Returns path to output directory for current tab and iteration
- property post_processing_output_path: pathlib.Path¶
Returns path to output directory for post processing output. The Workflow does not put anything here itself by default, but this is provided as convenience to the user.
- reset_to(view_index: int)[source]¶
Resets the Workflow and replays the first
view_index
actions. History in memory will be mutated. Because this resets the loop_idx variable to match, saved output will override previous output.
- run()[source]¶
Runs the workflow loop! Intializes the tabs to the starting URLs and then calls
run_once()
in a loop until the goal function returns True or all tabs have stopped.
- run_once()[source]¶
Runs a single iteration of the WTL flow, i.e. snapshots, runs classifiers, checks goal, computes actions, calls policy, executes actions. Check loop_idx member attribute for number of iterations.
Note
This does not initialize the tabs to their starting URLs. Normally, use
run()
instead.- Returns
The boolean output from the goal function.
- property scraper: webtraversallibrary.scraper.Scraper¶
Provides a
Scraper
instance associated to the current window.
- smart_scroll_to(element: webtraversallibrary.geometry.Rectangle)[source]¶
Uses js.scroll_to in a slightly smarter way to minimize required scrolls and vertically center elements of interest.
- property success: bool¶
Returns True if there are any open (i.e. not-cancelled) tabs.
- tab(name: str) webtraversallibrary.workflow.Workflow [source]¶
Sets the current tab to the given name.
- property tabs: List[str]¶
Returns a list of all tab names.
- view(view: webtraversallibrary.view.View) webtraversallibrary.workflow.Workflow [source]¶
Sets the current tab to the one where given view was taken from.
- window(name: str)[source]¶
Returns the window instance with the name provided when created with
create_window()
.
- property windows¶
Returns a list of all window instances.