Official implementation of:
Lim, H., Li, X., Park, S., Li, Q., & Kim, J. (2026). Reducing contextual noise in review-based recommendation via aspect term extraction and attention modeling. Information Sciences, 735, 123078. Paper
This repository is the official implementation of ATRS (Aspect Term-aware Recommender System), published in Information Sciences (2026).
Most review-based recommendation models process entire review bodies indiscriminately, allowing aspect-relevant signal to be diluted by surrounding context. ATRS addresses this by routing review text through a dedicated Aspect Term Extraction (ATE) stage that filters out non-aspect content before downstream encoding.
The retained aspect terms are encoded with a 1D-CNN over Word2Vec embeddings, fused with user/item ID embeddings, and passed through a self-attention block to form aspect-aware user and item representations. These are concatenated and forwarded to an MLP that predicts a continuous rating score as a regression target. Quantitative comparisons against representative recommendation baselines on Amazon and Yelp datasets are reported in Experimental Results.
├── data/
│ ├── raw/ # Source datasets — place {fname}.{raw_ext} here
│ ├── processed/ # Pipeline parquet caches (preprocessed / aspects)
│ └── ate_output/ # PyABSA workspace + extraction JSON
│ └── .pyabsa/ # Contained pyabsa CWD: checkpoints/, checkpoints.json, result JSON
│
├── model/
│ ├── atrs.py # ATRS architecture, trainer, predictor
│ ├── ATRS Architecture.png # Architecture diagram
│ └── save/ # Best checkpoint per dataset (best.pth)
│
├── src/
│ ├── config.yaml # Single source of truth for all hyperparameters
│ ├── data_processing.py # DataProcessor pipeline + Dataset/DataLoader factory
│ ├── aspect_extraction.py # ATExtractor — PyABSA wrapper for aspect term extraction
│ ├── preprocessing.py # Review-text cleaning and row filters
│ ├── path.py # Project path constants (auto-creates runtime folders)
│ └── utils.py # Generic helpers — I/O, metrics, seeding
│
├── main.py # Entry point: data preparation → train → test
├── requirements.txt
└── README.mdATRS consists of two sequential modules. Aspect extraction runs in src/aspect_extraction.py (orchestrated by src/data_processing.py); the recommender network is in model/atrs.py. The full architecture is illustrated below.
A pretrained Transformer encoder (PyABSA's English ATE checkpoint, FAST-LCF-ATEPC over DeBERTa-v3-base) reads each cleaned review and emits BIO-tagged aspect terms. Per-row aspect lists are then aggregated into per-user and per-item aspect sets, which become the inputs to the RS module.
Each user and item aspect set is tokenized over a Word2Vec-trained vocabulary, encoded by a 1D-CNN (AspectEncoder), and concatenated with a learned ID embedding. The fused vector is projected and passed through a multi-head self-attention + FFN block (SelfAttentionBlock) to yield aspect-aware user and item representations. Their concatenation is fed to an MLP regressor (ATRS.regressor) that outputs the predicted rating.
All hyperparameters live in src/config.yaml — it is the single source of truth. Defaults reproduce the paper experiments.
A CUDA-capable GPU is recommended; main.py falls back to CPU with a warning if CUDA is unavailable. See requirements.txt for the GPU wheel and CPU-only setup.
End-to-end run:
conda create -n atrs python=3.11
conda activate atrs
pip install -r requirements.txt
python main.pyPlace the dataset as data/raw/{fname}.{raw_ext} where {fname} and {raw_ext} match data.fname / data.raw_ext in config.yaml. The file is read as JSON-lines (one review object per line) — each line must carry the columns below, or the run aborts at load with a KeyError.
| Column | Role |
|---|---|
user_id |
Reviewer id — user-side aspect aggregation and ID embedding. |
parent_asin |
Product id — item-side aspect aggregation and ID embedding. |
text |
Review body — cleaned, then aspect terms are extracted from it (review_text is also accepted as an alias). |
rating |
Ground-truth rating; the regression target the model predicts. |
verified_purchase |
Boolean flag; only verified-purchase reviews are kept. |
Optional: an aspect column of pre-extracted per-row aspect lists — if present, the PyABSA extraction stage is skipped. Any other columns are ignored. The pipeline writes two cache layers under data/processed/:
{fname}_preprocessed.parquet— written after text cleaning and the k-core filter.- Columns: the required columns above +
clean_text(HTML/URL-stripped, lowercased, contraction-expanded, stop-word-removed, lemmatized review body). Any extra raw columns pass through untouched.
- Columns: the required columns above +
{fname}_aspects.parquet— adds the extracted aspect terms and their per-user/item aggregation.- Columns: the preprocessed columns +
aspect(per-row aspect-term list),user_aspect_set/item_aspect_set(each id's aspect terms flattened across all its reviews).
- Columns: the preprocessed columns +
On every python main.py, the pipeline resumes from the most-complete cache on disk, checking newest-first (aspects → preprocessed → raw) and falling through to the next-earliest stage. The train/test split, Word2Vec, and sequence padding always run fresh in memory, so changes to test_size, seed, val_ratio, aspect_length_percentile, or w2v_* take effect on the next run. To re-trigger an upstream stage, delete its parquet.
ATRS was evaluated on three real-world review datasets: Musical Instruments, Video Games, and Yelp (Pennsylvania). The results demonstrate that ATRS consistently outperforms representative baselines across all evaluation metrics, achieving average improvements of 19.54% in MAE and 11.89% in RMSE.
| Model | Musical Instruments | Video Games | Yelp | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAE | MSE | RMSE | MAPE | MAE | MSE | RMSE | MAPE | MAE | MSE | RMSE | MAPE | |
| PMF | 1.306 | 2.640 | 1.625 | 35.034 | 1.220 | 2.407 | 1.551 | 33.948 | 1.276 | 2.803 | 1.674 | 38.330 |
| NCF | 1.174 | 1.705 | 1.306 | 35.401 | 0.948 | 1.331 | 1.154 | 35.032 | 1.085 | 1.674 | 1.294 | 39.320 |
| DeepCoNN | 0.786 | 1.137 | 1.067 | 29.931 | 0.847 | 1.263 | 1.124 | 32.850 | 0.937 | 1.381 | 1.175 | 38.276 |
| NARRE | 0.767 | 0.993 | 0.997 | 29.459 | 0.776 | 1.173 | 1.083 | 30.518 | 0.886 | 1.212 | 1.101 | 36.724 |
| AENAR | 0.665 | 0.970 | 0.985 | 27.193 | 0.693 | 1.002 | 1.001 | 28.039 | 0.845 | 1.177 | 1.085 | 35.605 |
| SAFMR | 0.705 | 0.975 | 0.987 | 28.388 | 0.711 | 1.033 | 1.016 | 30.016 | 0.881 | 1.229 | 1.109 | 36.076 |
| MFNR | 0.708 | 0.965 | 0.982 | 26.922 | 0.730 | 0.980 | 0.990 | 27.863 | 0.855 | 1.174 | 1.084 | 33.923 |
| ATRS (Proposed) | 0.640 | 0.933 | 0.966 | 26.638 | 0.646 | 0.970 | 0.985 | 27.537 | 0.832 | 1.163 | 1.078 | 34.917 |
If you use this repository in your research, please cite:
@article{LIM2026123078,
title = {Reducing contextual noise in review-based recommendation via aspect term extraction and attention modeling},
author = {Heena Lim and Xinzhe Li and Seonu Park and Qinglong Li and Jaekyeong Kim},
journal = {Information Sciences},
volume = {735},
pages = {123078},
year = {2026},
doi = {10.1016/j.ins.2026.123078}
}For research inquiries or collaborations, please contact:
Seonu Park Ph.D. Student, Department of Big Data Analytics Kyung Hee University Email: sunu0087@khu.ac.kr
Qinglong Li Assistant Professor, Division of Computer Engineering Hansung University Email: leecy@hansung.ac.kr
Last updated: June 2026
