-
Notifications
You must be signed in to change notification settings - Fork 2
Database
Note
This applies to the PostgreSQL rewrite, which has not yet been fully released.
Because everything is centralized into one tracker, and so much information is kept about jobs, the database schema is rather large.
Wow this got a lot of scope creep. Might need to remove some things later if they're never used.
SQLAlchemy Core is used (without the ORM). This is effectively just an SQL query builder.
- Pipeline: A worker. Pipelines can have multiple slots, which allows them to work on multiple pages at once.
- Job: A group of pages.
- Page: A single webpage. You may occasionally see this referred to as an "item" in the code; this terminology is no longer used to avoid confusion.
- Relation: A link between two pages. If a page links to another page, a corresponding relation will be created.
- Attempt: An attempt to grab a page.
- Result: Information saved from an attempt, e.g. screenshots.
- Niceness: Lower nice values are considered higher priority (i.e. you are less "nice" to the other things in the queue). All priority is implemented this way in mnbot.
Every pipeline has an identifier, used to represent the pipeline. This should be somewhat human readable.
Pipelines are authenticated to the tracker with an API token (note: not an HMAC-based system, simply a randomly-generated password).
Pipelines can be assigned tags, which are arbitrary identifiers. A job can be configured to only run on pipelines with a specific tag. Matchonly pipelines will only run jobs explicitly assigned to one of their tags; jobs without tags are not considered.
Jobs are stored in a priority, FIFO queue. They have four states:
- ACTIVE: the job is available for dequeuing.
- DRAINING: the job has no pending items, but there are still some claimed items. The job will not be claimed.
- FINISHED: the job has no pending or claimed items. The job will not be claimed.
- ABORTED: an operator cancelled the job. The job will not be claimed. Jobs can be "un-aborted" freely, but if the intent is to pause a job, it is instead recommended to set concurrency to 0 as that is a clearer indication.
Pipelines will "stick" to a certain job unless a better (sooner and/or lower nice value) option shows up. Consequently, if job state does not change, job distribution will not either.
Claims can be given locks, either INDEFINITELY or UNTIL_FINISHED (which is cleared when the job is finished). If a claim is locked, the corresponding pipeline will not be assigned another job until it is cleared.
Pages have two tables: the pages table, a priority and FIFO queue containing most of the information about the page, and the relations table, which stores a tree of where discovery occurs.
Each page has one entry in the pages table, but there is no limit on the number of relations.
The inclusion of one page's relations to others is largely a side effect of how depth control works. Wpull (and by extension ArchiveBot and grab-site) has a bug originating from its lack of this functionality: if items in the queue are not completed in order of depth (for example, if an error is encounted on the first try), any links extracted from that URL will be stored at a higher depth than they otherwise might be. If there is a depth limit to the crawl, that means that some URLs may never be extracted at all. mnbot works around this issue by storing "relations" that describe where each page was discovered. When dequeuing pages, the lowest depth value is retrieved and that is what is then filtered.
This design has the following constraints:
- It is impossible to conclude that a URL is out of scope before every in-scope URL has been completed.
- Any out-of-scope (due to depth) URLs that are discovered must still be added to the queue, in case a shorter path is found later.
- If a page is queued with a shorter path than currently exists, all relations with that page as a parent must be recursively updated. (The alternative is to calculate the depth during dequeuing.)
- Assuming that errors are infrequent and the priority system is not used, this should be a rare occurrence and will therefore not hurt performance. If that assumption is incorrect, a different system may need to be used.
- This solution only solves the issue of reliably determining whether or not a URL is in scope. Crawl behaviour that depends on depth will not be any more reliable.
Pages can have the following states:
- READY: the page is available for dequeuing.
- CLAIMED: the page is claimed by a pipeline. It is not available for dequeuing.
- SKIPPED: the page has been skipped by a job rule. It is not available for dequeuing.
- STASHED: currently not used in mnbot. The page is not available for dequeuing.
You may notice that there is not any "FINISHED" state. A page being finished is determined by its attempt counter. Pages have two columns related to this: attempts, containing the number of attempts since the last update to attempts_remaining, and attempts_remaining, containing (surprisingly) the number of attempts remaining. When a page reaches zero attempts remaining, it is considered done and is ineligible for dequeuing.
A job is considered complete when there are no pages in the READY or CLAIMED states.
Tbd.
Jobs can have rules attached to them - this is how most crawl behaviour is applied.
The following settings currently exist. Rules are applied in order, with the last value always having priority (multiple values are not supported).
- accept: true/false; whether to add the url to the queue.
- skip: true/false; instead of dequeuing the URL, update its status to SKIPPED.
- custom_js: run custom JavaScript code.
- ua: user-agent behaviour.
Job rules are stored in the job_rulesets table. The latest ruleset is selected when dequeuing a page, and that is used for all crawl logic during that particular attempt. Old rulesets are kept to preserve context for old attempts.