GitHub - adsabs/ClassifierPipeline: Pipeline to classify new SciX records into appropriate collections.

Classifier Pipeline

Short Summary

This pipeline assigns articles to Collections based on the following criteria: LLM prediction, journal provenance, and citations. Only the first criteria is implemented currently.

This pipeline typically recieves records from the Master Pipeline and returns them to the master pipeline.

Required Software

- RabbitMQ and PostgreSQL

To Run

To run the Classifier Pipeline directory given a .csv input file with columns bibcode, title, and abstract:

python run.py -n -r path/to/file.csv

To manually verify a set of classified records:

python run.py -v -r path/to/file.csv

To resend a record or a set of records to the Master Pipeline use the following: For a bibcode:

python run.py -s -b bibcode

For a SciXID:

python run.py -s -x SciXID

For a Classification batch:

python run.py -s -i run_id

Running from the Master Pipeline

If calling the classifier from the Master Pipeline:

python run.py --classify --manual -n path/to/file.csv

If called using --classify the classifications are indexed immediately. To allow a curator inspection of the results before indexing use --classify_verify.

Benchmarking

Production-like throughput profiling is available via:

python -m ClassifierPipeline.benchmark run --dataset ClassifierPipeline/tests/stub_data/stub_new_records.csv --mode real --batch-size 100

Run the default fixed sweep (batch_sizes=[50,100,250,500], modes=real,fake):

python -m ClassifierPipeline.benchmark sweep --dataset ClassifierPipeline/tests/stub_data/stub_new_records.csv

Compare a candidate run against a baseline:

python -m ClassifierPipeline.benchmark compare --baseline path/to/baseline.json --candidate path/to/candidate.json

Artifacts:

Markdown report (human summary)
JSON metrics payload (for diffing/comparisons)

Environment flags used by workers for profiling:

PERF_METRICS_ENABLED=true
PERF_METRICS_PATH=/absolute/path/to/perf_events.jsonl
PERF_FORCE_FAKE_DATA=true|false (controls fake/real inference mode per worker process)

Attached Benchmarking

There are now two wrapper layers for benchmark execution:

an in-container runner:

/app/scripts/run-in-container-benchmark.bash \
  --dataset /app/logs/2023Sci_titles_abstracts_200.csv \
  --model-inference-batch-size 16 \
  --model-num-threads 4 \
  --model-num-interop-threads 1

an attach-only host wrapper for already-running classifier containers:

bash /Users/thomasallen/Code/ADS/ADSIngestPipelineTestEnvironment/run-attached-classifier-benchmark.bash \
  --container classifier_pipeline \
  --model-inference-batch-size 16 \
  --model-num-threads 4 \
  --model-num-interop-threads 1

The host wrapper:

requires an explicit target container
does not restart or tear down services
captures host context separately from benchmark runtime metadata
writes a host-side manifest plus translated host paths for /app artifacts when that mount is detectable

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
ClassifierPipeline		ClassifierPipeline
alembic		alembic
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
alembic.ini		alembic.ini
config.py		config.py
config_pipeline.py		config_pipeline.py
config_script.py		config_script.py
docker-compose.yaml		docker-compose.yaml
harvest_solr.py		harvest_solr.py
postprocess_output.py		postprocess_output.py
quick_classifier.py		quick_classifier.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classifier Pipeline

Short Summary

Required Software

To Run

Running from the Master Pipeline

Benchmarking

Attached Benchmarking

About

Uh oh!

Releases 9

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Classifier Pipeline

Short Summary

Required Software

To Run

Running from the Master Pipeline

Benchmarking

Attached Benchmarking

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages