Compatibility class by dale-wahl · Pull Request #605 · digitalmethodsinitiative/4cat

dale-wahl · 2026-06-16T15:06:45Z

Create a `Compatibility` class for processors to use and replace/generalize `is_compatible_with`.

That is the goal of this PR. I have done my best to keep it functionality identical except a handful of small, intentional fixes/improvements:" + a short list:

split_by_thread: 4chan/8chan → fourchan/eightchan (the old ids never matched a real datasource — latent bug).
api_tool options endpoint now 422s for an undeclared processor on a non-top dataset (consistent with the UI listing).
api_standalone now passes config, so settings-gated processors are evaluated instead of uniformly excluded.
youtube_metadata uses is_collector where it used is_top_dataset (intent should be the same; see below for details).
perspective given an explicit top_dataset_only=True, extensions={"csv","ndjson"} (it was silently on the bare default).

The is_compatible_with will still be able to override Compatibility for advanced checks (such as needing to walk the genealogy) to check multiple DataSets. Compatibility should still be declared as a guide for compatibility checks that do not require a DataSet object.

Right now it is serving the point of centralizing and standardizing compatibility checks. Many processors do not actually require an existing DataSet in order to check compatibility. The Compatibility object will also enable a dynamic processor map and I have attempted to separate compatibility axes that require an existing DataSet object with those that do not. That way, we can still show a looser category of "could be compaptible given X condition is met" (e.g. dataset has Y and Z columns).

There should be a second follow-up PR to re-examine processor compatibility needs and make processors more declarative of the DataSet shape. See follow-up below.

Notes on decisions:

The default compatibility is Compatibility(top_dataset_only=True) which was our pre-existing default (what happened when is_compatible_with did not exist via has_attr; is_compatible_with is now a method on BasicProcessor so that check no longer works).
For abstract BasicProcessors such as BaseFilter or TwitterStatsBase, I set Compatibility(types=set()) which is to say, not compatible with any type. We could add an abstract attribute or something if that is not clear, but I felt "not compatible with any type" was actually clearer.
I folded existing followups into Compatibility(preferred_followups=[]) with the hope to use it later for the map or recommendations/processor ordering.
Compatibility(excluded_followups=[]) replaces exclude_followup_processors being redefined in child classes. It is already used to hide processors from appearing in the UI even though they would otherwise be compatible.
We have a is_from_collector and is_top_dataset which are, naturally, defined differently. Though I am not sure there is an existing DataSet type where those do not agree. I kept them both for Compatibility but use them differently. is_from_collector is a type check (ends in -search or -import) while is_top_dataset requires the DataSet to check for self.key_parent (we stub in BasicProcessor to return False as the "default"). In terms of compatibility requirements, is_from_collector is just a different way of looking for types so it is an "or", but is_top_dataset is a hard gate requiring a parent dataset and thus "and". We may wish to move relevant processors to use is_from_collector instead of the DataSet required check since I do not think much thought was put into which was used (and likely is_from_collector did not exist for much of 4CAT's life).
Check compatibility.py for the full list of compatibility gates (and even with all those, I still needed some overrides)

Potential review points/improvements:

We call shutil.which() a lot and could probably cache that result. I looked into a bit and something like this might work:

from functools import lru_cache

# this would also cache negative results though so needs a bit more work
@lru_cache(maxsize=None)
def _which(command):
    return shutil.which(command)

Along those lines, I have a config settings section lumped together. But I had to check some executives there. It may be better to separate quicker settings from more complex ones for the sake of short circuiting the is_compatible_with check faster (something I realized I lost mid development and added back).

Improving `Compatibility`

My hope is that visualizing the processor map will show connections we have not normally identified. My expectation is that we will see gaps and potential points to clarify between connections. Particularly dealing with get_columns and is_rankable. Those are variable in that a DataSet might hit the requirements for compatibility (by having the correct columns). But we also design certain processors to produce the necessary columns. For those processors, it would be beneficial to declare is_rankable and sets_columns which would allow processor mapping without an actual existing DataSet. We could, for example, create a Datasource base class that has sets_columns = set("timestamp", "body", "author", etc.) and create a test to ensure that is true which could then be used for mapping.

We also may benefit from declaring certain attributes as "variable" in some way. For example media_type is almost always certain from the class alone except in rare cases such as import media which sets media_type based on the files received. These sorts of things make the separation of DataSet and class very complicated (not to mention all the stubs in BasicProcessor...)

Status

Conversion complete across all processor categories and the only is_compatible_with functions left are overrides of the new Compatibility class.

…te is_compatible_with as it would be inherited, instead compatible w/ empty set() i.e. nothing

…nd leave expensive last

…/False

… but add coarse map specs in compatbility

dale-wahl added 17 commits June 15, 2026 17:25

create me a Compatibility class

12db02b

test it on some processors

bfaa455

processor: clear default behavoir

5eb8add

convert more type processors to use Compatibility

d7cd873

update type is_compatible_with checks

97ba63a

base_twitter_stats: compatibility with abstract class, cannot overwri…

76533fb

…te is_compatible_with as it would be inherited, instead compatible w/ empty set() i.e. nothing

is compatible w twitter stats subclasses

f19adf9

move compatibility, fold in followups

0ddfdde

fold in exclude processor followups into compatibility

500d6d1

compatibility: convert extension type checks

6004dba

base_filter: abstract class compatibility

91658ae

compatibility: convert top_dataset checks plus extension check

1168256

Merge branch 'master' into compatibility

7b29184

compatibility: multi type checks

309e132

compatibility: fix executable check (pass function) to settings check

863f085

compatibility: figure out ffmpeg -> ffprobe connection and generalize.

d71f6ce

compatibility: add a short circuit! do not check every requirement. a…

f8d6408

…nd leave expensive last

dale-wahl self-assigned this Jun 16, 2026

dale-wahl added 12 commits June 17, 2026 09:05

compatibility datasources

6f8e605

compatibility: add is_rankable and handle ranking multiple items True…

6ee0600

…/False

compatibility media_types

5384e92

compatibility: required_settings

27c8e2f

compatibility: clarify the dataset-required separation and make a helper

e18f264

compatibility: excluded_types, is_collector, child_only axes

345dd88

compatibility: base downloaders

6f6421a

compatibility: keep is_compatible_with overrides (for credentials),…

e466c3f

… but add coarse map specs in compatbility

compatibility: couple more with overrides

7483df1

compatibility: requires ANY column (in addition to requires all columns)

3c0da2d

video_hasher: easy compatibility

20dd2ff

compatibilities w/ overrides

248622f

dale-wahl added 2 commits June 17, 2026 14:54

compatibility cleanup

3302f48

clean up hasattr is_compatible_with checks

0fc03f7

dale-wahl marked this pull request as ready for review June 17, 2026 13:34

dale-wahl requested a review from stijn-uva June 17, 2026 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility class#605

Compatibility class#605
dale-wahl wants to merge 31 commits into
masterfrom
compatibility

dale-wahl commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dale-wahl commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Create a Compatibility class for processors to use and replace/generalize is_compatible_with.

Notes on decisions:

Potential review points/improvements:

Improving Compatibility

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dale-wahl commented Jun 16, 2026 •

edited

Loading

Create a `Compatibility` class for processors to use and replace/generalize `is_compatible_with`.

Improving `Compatibility`