Skip to content

Compatibility class#605

Open
dale-wahl wants to merge 31 commits into
masterfrom
compatibility
Open

Compatibility class#605
dale-wahl wants to merge 31 commits into
masterfrom
compatibility

Conversation

@dale-wahl

@dale-wahl dale-wahl commented Jun 16, 2026

Copy link
Copy Markdown
Member

Create a Compatibility class for processors to use and replace/generalize is_compatible_with.

That is the goal of this PR. I have done my best to keep it functionality identical except a handful of small, intentional fixes/improvements:" + a short list:

  • split_by_thread: 4chan/8chan → fourchan/eightchan (the old ids never matched a real datasource — latent bug).
  • api_tool options endpoint now 422s for an undeclared processor on a non-top dataset (consistent with the UI listing).
  • api_standalone now passes config, so settings-gated processors are evaluated instead of uniformly excluded.
  • youtube_metadata uses is_collector where it used is_top_dataset (intent should be the same; see below for details).
  • perspective given an explicit top_dataset_only=True, extensions={"csv","ndjson"} (it was silently on the bare default).

The is_compatible_with will still be able to override Compatibility for advanced checks (such as needing to walk the genealogy) to check multiple DataSets. Compatibility should still be declared as a guide for compatibility checks that do not require a DataSet object.

Right now it is serving the point of centralizing and standardizing compatibility checks. Many processors do not actually require an existing DataSet in order to check compatibility. The Compatibility object will also enable a dynamic processor map and I have attempted to separate compatibility axes that require an existing DataSet object with those that do not. That way, we can still show a looser category of "could be compaptible given X condition is met" (e.g. dataset has Y and Z columns).

There should be a second follow-up PR to re-examine processor compatibility needs and make processors more declarative of the DataSet shape. See follow-up below.

Notes on decisions:

  • The default compatibility is Compatibility(top_dataset_only=True) which was our pre-existing default (what happened when is_compatible_with did not exist via has_attr; is_compatible_with is now a method on BasicProcessor so that check no longer works).
  • For abstract BasicProcessors such as BaseFilter or TwitterStatsBase, I set Compatibility(types=set()) which is to say, not compatible with any type. We could add an abstract attribute or something if that is not clear, but I felt "not compatible with any type" was actually clearer.
  • I folded existing followups into Compatibility(preferred_followups=[]) with the hope to use it later for the map or recommendations/processor ordering.
  • Compatibility(excluded_followups=[]) replaces exclude_followup_processors being redefined in child classes. It is already used to hide processors from appearing in the UI even though they would otherwise be compatible.
  • We have a is_from_collector and is_top_dataset which are, naturally, defined differently. Though I am not sure there is an existing DataSet type where those do not agree. I kept them both for Compatibility but use them differently. is_from_collector is a type check (ends in -search or -import) while is_top_dataset requires the DataSet to check for self.key_parent (we stub in BasicProcessor to return False as the "default"). In terms of compatibility requirements, is_from_collector is just a different way of looking for types so it is an "or", but is_top_dataset is a hard gate requiring a parent dataset and thus "and". We may wish to move relevant processors to use is_from_collector instead of the DataSet required check since I do not think much thought was put into which was used (and likely is_from_collector did not exist for much of 4CAT's life).
  • Check compatibility.py for the full list of compatibility gates (and even with all those, I still needed some overrides)

Potential review points/improvements:

  • We call shutil.which() a lot and could probably cache that result. I looked into a bit and something like this might work:
from functools import lru_cache

# this would also cache negative results though so needs a bit more work
@lru_cache(maxsize=None)
def _which(command):
    return shutil.which(command)
  • Along those lines, I have a config settings section lumped together. But I had to check some executives there. It may be better to separate quicker settings from more complex ones for the sake of short circuiting the is_compatible_with check faster (something I realized I lost mid development and added back).

Improving Compatibility

My hope is that visualizing the processor map will show connections we have not normally identified. My expectation is that we will see gaps and potential points to clarify between connections. Particularly dealing with get_columns and is_rankable. Those are variable in that a DataSet might hit the requirements for compatibility (by having the correct columns). But we also design certain processors to produce the necessary columns. For those processors, it would be beneficial to declare is_rankable and sets_columns which would allow processor mapping without an actual existing DataSet. We could, for example, create a Datasource base class that has sets_columns = set("timestamp", "body", "author", etc.) and create a test to ensure that is true which could then be used for mapping.

We also may benefit from declaring certain attributes as "variable" in some way. For example media_type is almost always certain from the class alone except in rare cases such as import media which sets media_type based on the files received. These sorts of things make the separation of DataSet and class very complicated (not to mention all the stubs in BasicProcessor...)

Status

Conversion complete across all processor categories and the only is_compatible_with functions left are overrides of the new Compatibility class.

@dale-wahl dale-wahl self-assigned this Jun 16, 2026
@dale-wahl dale-wahl marked this pull request as ready for review June 17, 2026 13:34
@dale-wahl dale-wahl requested a review from stijn-uva June 17, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant