Compatibility class#605
Open
dale-wahl wants to merge 31 commits into
Open
Conversation
…te is_compatible_with as it would be inherited, instead compatible w/ empty set() i.e. nothing
…nd leave expensive last
… but add coarse map specs in compatbility
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Create a
Compatibilityclass for processors to use and replace/generalizeis_compatible_with.That is the goal of this PR. I have done my best to keep it functionality identical except a handful of small, intentional fixes/improvements:" + a short list:
split_by_thread: 4chan/8chan → fourchan/eightchan (the old ids never matched a real datasource — latent bug).api_tooloptions endpoint now 422s for an undeclared processor on a non-top dataset (consistent with the UI listing).api_standalonenow passes config, so settings-gated processors are evaluated instead of uniformly excluded.youtube_metadatausesis_collectorwhere it usedis_top_dataset(intent should be the same; see below for details).perspectivegiven an explicit top_dataset_only=True, extensions={"csv","ndjson"} (it was silently on the bare default).The
is_compatible_withwill still be able to overrideCompatibilityfor advanced checks (such as needing to walk the genealogy) to check multiple DataSets.Compatibilityshould still be declared as a guide for compatibility checks that do not require aDataSetobject.Right now it is serving the point of centralizing and standardizing compatibility checks. Many processors do not actually require an existing DataSet in order to check compatibility. The
Compatibilityobject will also enable a dynamic processor map and I have attempted to separate compatibility axes that require an existing DataSet object with those that do not. That way, we can still show a looser category of "could be compaptible given X condition is met" (e.g. dataset has Y and Z columns).There should be a second follow-up PR to re-examine processor compatibility needs and make processors more declarative of the DataSet shape. See follow-up below.
Notes on decisions:
Compatibility(top_dataset_only=True)which was our pre-existing default (what happened whenis_compatible_withdid not exist viahas_attr;is_compatible_withis now a method onBasicProcessorso that check no longer works).BaseFilterorTwitterStatsBase, I setCompatibility(types=set())which is to say, not compatible with any type. We could add anabstractattribute or something if that is not clear, but I felt "not compatible with any type" was actually clearer.followupsintoCompatibility(preferred_followups=[])with the hope to use it later for the map or recommendations/processor ordering.Compatibility(excluded_followups=[])replacesexclude_followup_processorsbeing redefined in child classes. It is already used to hide processors from appearing in the UI even though they would otherwise be compatible.is_from_collectorandis_top_datasetwhich are, naturally, defined differently. Though I am not sure there is an existing DataSettypewhere those do not agree. I kept them both forCompatibilitybut use them differently.is_from_collectoris atypecheck (ends in-searchor-import) whileis_top_datasetrequires the DataSet to check forself.key_parent(we stub inBasicProcessorto returnFalseas the "default"). In terms of compatibility requirements,is_from_collectoris just a different way of looking fortypesso it is an "or", butis_top_datasetis a hard gate requiring a parent dataset and thus "and". We may wish to move relevant processors to useis_from_collectorinstead of the DataSet required check since I do not think much thought was put into which was used (and likelyis_from_collectordid not exist for much of 4CAT's life).compatibility.pyfor the full list of compatibility gates (and even with all those, I still needed some overrides)Potential review points/improvements:
shutil.which()a lot and could probably cache that result. I looked into a bit and something like this might work:Improving
CompatibilityMy hope is that visualizing the processor map will show connections we have not normally identified. My expectation is that we will see gaps and potential points to clarify between connections. Particularly dealing with
get_columnsandis_rankable. Those are variable in that a DataSet might hit the requirements for compatibility (by having the correct columns). But we also design certain processors to produce the necessary columns. For those processors, it would be beneficial to declareis_rankableandsets_columnswhich would allow processor mapping without an actual existingDataSet. We could, for example, create aDatasourcebase class that hassets_columns = set("timestamp", "body", "author", etc.)and create a test to ensure that is true which could then be used for mapping.We also may benefit from declaring certain attributes as "variable" in some way. For example
media_typeis almost always certain from the class alone except in rare cases such as import media which setsmedia_typebased on the files received. These sorts of things make the separation ofDataSetand class very complicated (not to mention all the stubs inBasicProcessor...)Status
Conversion complete across all processor categories and the only
is_compatible_withfunctions left are overrides of the newCompatibilityclass.