Skip to content

cornshakes/scrollsurf

Repository files navigation

Scrollsurf

Visit this app on my raspberry and read the diary!

Scrollsurf lets you scroll through wikipedia article abstracts, like/dislike them, and visit the full articles on wikipedia itself. The articles that it shows you are randomly selected from these datasets:

Getting Started

npm install

Before you run the app for the first time, you have to download the datasets that you want using the provided package scripts. The downloads take a long time, but one dataset is enough to run the app: First, you have to add an .env file in scripts/, next to .env.example, and add your own User-Agent. Then you can

npm run download-vital-50000
npm run download-unusual
npm run download-good-articles
npm run download-featured-articles
npm run download-featured-pictures
npm run download-commons-featured-pictures
npm run download-quotes

Then, you can categorize the articles by running

npm run categorize

Currently, that's not very useful - it just builds a huge category tree that you can look at. After downloading at least one dataset, you can

npm run dev

and go to http://localhost:3000

Clicks, Likes & Dislikes

Different datasets come with slightly different topics per item, e.g. "Military" and "Warfare". To be able to better use topics in influencing the random feed, these topics have been manually grouped into "buckets". The topic_buckets table maps each (dataset, topic) pair to its bucket; an unmapped pair is its own bucket. The signals: The feed is random, but influenced by user activity. Three signals are tracked per bucket.

  • Like counts +1
  • Dislike counts −1
  • Following a link counts +0.5

These are averaged over seen items of that bucket, so a bucket needs a few signals before it starts to move — one stray like won't change much.

Unseen items are then drawn with weights based on the average affinity of their buckets, i.e. liked buckets show up more often, disliked buckets show up less.

Without any votes (or without the consent cookie, see below) the feed is random.

Example

Say you've scrolled for a while and your history per bucket looks like this:

Bucket Seen Likes Dislikes Clicks Affinity = (likes + 0.5·clicks − dislikes) / (seen + 5)
Vital → History 15 6 0 2 (6 + 1 − 0) / 20 = 0.35
Vital → Sports 15 0 6 0 (0 + 0 − 6) / 20 = −0.30
Vital → Arts 4 1 0 0 (1 + 0 − 0) / 9 = 0.11
anything you haven't voted on 0

The + 5 in the denominator is the smoothing: the lone Arts like only gets a third of the affinity of the six History likes, even though it's a 100% like rate.

Each bucket's affinity is clamped to ±2 (AFFINITY_CLAMP) so no single bucket can run away. Each unseen item then gets a weight of exp(2 · affinity) (the 2 is AFFINITY_STRENGTH):

Item tagged Mean affinity Weight
History 0.35 exp(0.70) ≈ 2.0
Sports −0.30 exp(−0.60) ≈ 0.55
Arts 0.11 exp(0.22) ≈ 1.25
History and Sports (0.35 − 0.30) / 2 = 0.025 exp(0.05) ≈ 1.05
no voted buckets 0 exp(0) = 1.0

The weight is the item's relative chance per feed slot: a History item is about twice as likely to appear as a neutral one, and about 3.7× as likely as a Sports one — but even Sports items keep showing up at roughly half the neutral rate. An item in both a liked and a disliked bucket lands back near neutral, because affinities are averaged across its buckets.

The SQL behind it

There are no per-topic queries and no mixing of result sets in TypeScript — the whole draw happens inside one SELECT over all three item types. Items (articles, pictures, quotes) share a unified items supertype and a single item_topics table, so the selection is type-agnostic; per-type payload columns are fetched afterwards. The statement is assembled from shared SQL fragments in src/lib/db/affinity.ts (the constants from the example are baked into the string; only $user_id and $limit are bound at query time) and chains a handful of CTEs (src/lib/db/affinity.ts) before the actual selection (src/lib/db/feed.ts):

WITH clicked AS (              -- distinct items you clicked links on
  SELECT DISTINCT item_id FROM user_clicks WHERE user_id = $user_id
),
item_buckets AS (             -- map each item's (dataset, topic) rows to buckets
  SELECT DISTINCT item_id,
         COALESCE(bucket, dataset || topic) AS bucket  -- unmapped pair → its own bucket
  FROM item_topics LEFT JOIN topic_buckets USING (dataset, topic)
),
bucket_affinity AS (           -- the first table from the example, grouped by bucket:
  SELECT bucket,               -- one GROUP BY over your seen items
         (1.0*likes + 0.5*clicks - 1.0*dislikes) / (seen + 5) AS affinity
  FROM user_items JOIN item_buckets USING (item_id) LEFT JOIN clicked ...
  WHERE user_id = $user_id
  GROUP BY bucket
),
item_affinity AS (             -- the second table: AVG over each item's buckets
  SELECT item_id, AVG(COALESCE(affinity, 0)) AS affinity
  FROM item_buckets LEFT JOIN bucket_affinity USING (bucket)
  GROUP BY item_id
),
eligible_pool AS (             -- unseen items that have at least one topic
  SELECT type, id FROM items WHERE <unseen by $user_id> AND <has a topic>
),
pool_size AS (                 -- count of eligible items per type
  SELECT type, COUNT(*) AS n FROM eligible_pool GROUP BY type
)
SELECT p.type, p.id
FROM eligible_pool p
JOIN pool_size ps ON ps.type = p.type
LEFT JOIN item_affinity ia ON ia.item_id = p.id
WHERE p.type IN ('article', 'picture', 'quote')
ORDER BY
  -ln(random_0_to_1)
  / ( exp(2 * clamp(ia.affinity, -2, 2))     -- affinity weight
      * type_share                            -- TYPE_SHARES[p.type]
      / max(ps.n, 1) )                        -- ÷ pool size of that type
LIMIT $limit

The ORDER BY line is the whole sampling trick (Efraimidis–Spirakis): every candidate row draws its own uniform random number, the weight stretches it, and taking the smallest n keys is mathematically the same as drawing n items without replacement with probability proportional to weight. So the "randomness" and the "weighting" live in the same expression — there's no second pass, no shuffle in TS.

The weight has two factors. The first is the affinity term from the example (clamped to ±2). The second is a per-type share: type_share / pool_size, where type_share is the fixed TYPE_SHARES map in feed.ts (article 0.82, picture 0.1, quote 0.08) and pool_size is the count of eligible items of that type. Dividing by pool size makes each type's expected fraction equal its share ÷ Σshares, independent of how many items each pool actually holds. A type with share 0, or absent from the map, is hard-excluded by the WHERE clause.

For anonymous (and brand-new) users $user_id is NULL, which matches nothing in the signal CTEs, so every item falls back to affinity 0 → uniform within each type, through the exact same query. The selected (type, id) rows are then hydrated into full Article / Picture / Quote payloads per type.

Cookies, Accounts & Login

Once the user agrees to use cookies, their seen items, likes and clicks are stored with the cookie as the key. So when the cookie is cleared or expires after inactivity, that data is lost. Logging in with an email binds that history to an account, which multiple browser cookies can then point at.

The tokens table maps each browser cookie to a row in users, and several cookies can point at the same account. An anonymous user is a users row with no email. Stale cookies are swept after USER_INACTIVITY_DAYS (default 14 of inactivity by cleanup_inactive_users on startup — the underlying user row and its history survive, only the cookie is dropped.

Passwordless email login

Login is code-only — there is no password:

  1. Enter your email → the server create_login_code creates a single-use 6-digit code, upserted into the login_codes table (one row per email, 15-minute expiry, 60-second resend rate-limit) and sent over SMTP. If SMTP_HOST is unset (dev/e2e) the code is just logged to the server console instead of emailed.
  2. Enter the code → verify_login_code checks it and deletes it (single use; wrong attempts don't delete it, so you can retry until it expires).
  3. On success attach_login binds your browser token to an account, and submit_login_code sets ss_uid and grants consent — logging in implies consent.

Merging / switching user histories on login

attach_login reconciles any existing votes/clicks on login. It resolves the current browser cookie and the account matching the email, then picks one of these branches:

Situation Outcome
Token already points at this account No-op, stay logged in
You're anonymous, account exists Merge — fold your anon history into the account
You're logged into another account Switch — repoint just this browser's cookie; no merge
No cookie at all, account exists Mint a fresh cookie for the account
You're anonymous, email is new Promote — your anon user becomes the account (keeps all history)
No cookie, email is new Create a brand-new empty account

(merge_anon_to_account)

When "logging in", i.e. changing from anonymous to email, the anonymous account's votes are merged into the existing account's votes with the existing one taking precedence on conflict. Append-only clicks are carried over (added, never replacing the account's), the cookie of the anonymous identity is repointed to the account, and the anon user row is deleted. When "switching" accounts, i.e. changing from one email account to another, the browser cookie is pointed to the other account without any further changes.

Consent & revoking it

Consent is recorded in the client-readable ss_consent cookie (granted / denied / unknown), surfaced through ConsentContext in CookieConsent.tsx. Voting and link-click tracking are consent-gated: without granted consent the client never fires the request and opens the consent dialog instead.

Revoking consent while logged in (unlink_email) clears the email field only — the account's history stays intact but is no longer recoverable by email re-login. Logging in again with the same email therefore lands in the "email is new" branch and creates a fresh, empty account.

Integration Testing

All e2e tests run against a small example database (e2e/.data/). That database is created from the downloaded datasets using the test:e2e:create-db script. It is committed so that you don't have to download all datasets before being able to run e2e tests.

npm run test:e2e            # run all integration tests (seeds DB automatically)
npm run playwright-ui       # same, but with Playwright's interactive UI
npm run test:e2e:update     # updates screenshots

Future inspiration

These Main topic classifications are not what I have

Wikipedia:Contents

why not reddit

Wikipedia:Categorization

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages