Scrollsurf

Visit this app on my raspberry and read the diary!

Scrollsurf lets you scroll through wikipedia article abstracts, like/dislike them, and visit the full articles on wikipedia itself. The articles that it shows you are randomly selected from these datasets:

Getting Started

npm install

Before you run the app for the first time, you have to download the datasets that you want using the provided package scripts. The downloads take a long time, but one dataset is enough to run the app: First, you have to add an .env file in scripts/, next to .env.example, and add your own User-Agent. Then you can

npm run download-vital-50000
npm run download-unusual
npm run download-good-articles
npm run download-featured-articles
npm run download-featured-pictures
npm run download-commons-featured-pictures
npm run download-quotes

Then, you can categorize the articles by running

npm run categorize

Currently, that's not very useful - it just builds a huge category tree that you can look at. After downloading at least one dataset, you can

npm run dev

and go to http://localhost:3000

Clicks, Likes & Dislikes

Different datasets come with slightly different topics per item, e.g. "Military" and "Warfare". To be able to better use topics in influencing the random feed, these topics have been manually grouped into "buckets". The topic_buckets table maps each (dataset, topic) pair to its bucket; an unmapped pair is its own bucket. The signals: The feed is random, but influenced by user activity. Three signals are tracked per bucket.

Like counts +1
Dislike counts −1
Following a link counts +0.5

These are averaged over seen items of that bucket, so a bucket needs a few signals before it starts to move — one stray like won't change much.

Unseen items are then drawn with weights based on the average affinity of their buckets, i.e. liked buckets show up more often, disliked buckets show up less.

Without any votes (or without the consent cookie, see below) the feed is random.

Example

Say you've scrolled for a while and your history per bucket looks like this:

Bucket	Seen	Likes	Dislikes	Clicks	Affinity = (likes + 0.5·clicks − dislikes) / (seen + 5)
Vital → History	15	6	0	2	(6 + 1 − 0) / 20 = 0.35
Vital → Sports	15	0	6	0	(0 + 0 − 6) / 20 = −0.30
Vital → Arts	4	1	0	0	(1 + 0 − 0) / 9 = 0.11
anything you haven't voted on					0

The + 5 in the denominator is the smoothing: the lone Arts like only gets a third of the affinity of the six History likes, even though it's a 100% like rate.

Each bucket's affinity is clamped to ±2 (AFFINITY_CLAMP) so no single bucket can run away. Each unseen item then gets a weight of exp(2 · affinity) (the 2 is AFFINITY_STRENGTH):

Item tagged	Mean affinity	Weight
History	0.35	exp(0.70) ≈ 2.0
Sports	−0.30	exp(−0.60) ≈ 0.55
Arts	0.11	exp(0.22) ≈ 1.25
History and Sports	(0.35 − 0.30) / 2 = 0.025	exp(0.05) ≈ 1.05
no voted buckets	0	exp(0) = 1.0

The weight is the item's relative chance per feed slot: a History item is about twice as likely to appear as a neutral one, and about 3.7× as likely as a Sports one — but even Sports items keep showing up at roughly half the neutral rate. An item in both a liked and a disliked bucket lands back near neutral, because affinities are averaged across its buckets.

The SQL behind it

There are no per-topic queries and no mixing of result sets in TypeScript — the whole draw happens inside one SELECT over all three item types. Items (articles, pictures, quotes) share a unified items supertype and a single item_topics table, so the selection is type-agnostic; per-type payload columns are fetched afterwards. The statement is assembled from shared SQL fragments in src/lib/db/affinity.ts (the constants from the example are baked into the string; only $user_id and $limit are bound at query time) and chains a handful of CTEs (src/lib/db/affinity.ts) before the actual selection (src/lib/db/feed.ts):

WITH clicked AS (              -- distinct items you clicked links on
  SELECT DISTINCT item_id FROM user_clicks WHERE user_id = $user_id
),
item_buckets AS (             -- map each item's (dataset, topic) rows to buckets
  SELECT DISTINCT item_id,
         COALESCE(bucket, dataset || topic) AS bucket  -- unmapped pair → its own bucket
  FROM item_topics LEFT JOIN topic_buckets USING (dataset, topic)
),
bucket_affinity AS (           -- the first table from the example, grouped by bucket:
  SELECT bucket,               -- one GROUP BY over your seen items
         (1.0*likes + 0.5*clicks - 1.0*dislikes) / (seen + 5) AS affinity
  FROM user_items JOIN item_buckets USING (item_id) LEFT JOIN clicked ...
  WHERE user_id = $user_id
  GROUP BY bucket
),
item_affinity AS (             -- the second table: AVG over each item's buckets
  SELECT item_id, AVG(COALESCE(affinity, 0)) AS affinity
  FROM item_buckets LEFT JOIN bucket_affinity USING (bucket)
  GROUP BY item_id
),
eligible_pool AS (             -- unseen items that have at least one topic
  SELECT type, id FROM items WHERE <unseen by $user_id> AND <has a topic>
),
pool_size AS (                 -- count of eligible items per type
  SELECT type, COUNT(*) AS n FROM eligible_pool GROUP BY type
)
SELECT p.type, p.id
FROM eligible_pool p
JOIN pool_size ps ON ps.type = p.type
LEFT JOIN item_affinity ia ON ia.item_id = p.id
WHERE p.type IN ('article', 'picture', 'quote')
ORDER BY
  -ln(random_0_to_1)
  / ( exp(2 * clamp(ia.affinity, -2, 2))     -- affinity weight
      * type_share                            -- TYPE_SHARES[p.type]
      / max(ps.n, 1) )                        -- ÷ pool size of that type
LIMIT $limit

The ORDER BY line is the whole sampling trick (Efraimidis–Spirakis): every candidate row draws its own uniform random number, the weight stretches it, and taking the smallest n keys is mathematically the same as drawing n items without replacement with probability proportional to weight. So the "randomness" and the "weighting" live in the same expression — there's no second pass, no shuffle in TS.

The weight has two factors. The first is the affinity term from the example (clamped to ±2). The second is a per-type share: type_share / pool_size, where type_share is the fixed TYPE_SHARES map in feed.ts (article 0.82, picture 0.1, quote 0.08) and pool_size is the count of eligible items of that type. Dividing by pool size makes each type's expected fraction equal its share ÷ Σshares, independent of how many items each pool actually holds. A type with share 0, or absent from the map, is hard-excluded by the WHERE clause.

For anonymous (and brand-new) users $user_id is NULL, which matches nothing in the signal CTEs, so every item falls back to affinity 0 → uniform within each type, through the exact same query. The selected (type, id) rows are then hydrated into full Article / Picture / Quote payloads per type.

Cookies, Accounts & Login

Once the user agrees to use cookies, their seen items, likes and clicks are stored with the cookie as the key. So when the cookie is cleared or expires after inactivity, that data is lost. Logging in with an email binds that history to an account, which multiple browser cookies can then point at.

The tokens table maps each browser cookie to a row in users, and several cookies can point at the same account. An anonymous user is a users row with no email. Stale cookies are swept after USER_INACTIVITY_DAYS (default 14 of inactivity by cleanup_inactive_users on startup — the underlying user row and its history survive, only the cookie is dropped.

Passwordless email login

Login is code-only — there is no password:

Enter your email → the server create_login_code creates a single-use 6-digit code, upserted into the login_codes table (one row per email, 15-minute expiry, 60-second resend rate-limit) and sent over SMTP. If SMTP_HOST is unset (dev/e2e) the code is just logged to the server console instead of emailed.
Enter the code → verify_login_code checks it and deletes it (single use; wrong attempts don't delete it, so you can retry until it expires).
On success attach_login binds your browser token to an account, and submit_login_code sets ss_uid and grants consent — logging in implies consent.

Merging / switching user histories on login

attach_login reconciles any existing votes/clicks on login. It resolves the current browser cookie and the account matching the email, then picks one of these branches:

Situation	Outcome
Token already points at this account	No-op, stay logged in
You're anonymous, account exists	Merge — fold your anon history into the account
You're logged into another account	Switch — repoint just this browser's cookie; no merge
No cookie at all, account exists	Mint a fresh cookie for the account
You're anonymous, email is new	Promote — your anon user becomes the account (keeps all history)
No cookie, email is new	Create a brand-new empty account

(`merge_anon_to_account`)

When "logging in", i.e. changing from anonymous to email, the anonymous account's votes are merged into the existing account's votes with the existing one taking precedence on conflict. Append-only clicks are carried over (added, never replacing the account's), the cookie of the anonymous identity is repointed to the account, and the anon user row is deleted. When "switching" accounts, i.e. changing from one email account to another, the browser cookie is pointed to the other account without any further changes.

Consent & revoking it

Consent is recorded in the client-readable ss_consent cookie (granted / denied / unknown), surfaced through ConsentContext in CookieConsent.tsx. Voting and link-click tracking are consent-gated: without granted consent the client never fires the request and opens the consent dialog instead.

Revoking consent while logged in (unlink_email) clears the email field only — the account's history stays intact but is no longer recoverable by email re-login. Logging in again with the same email therefore lands in the "email is new" branch and creates a fresh, empty account.

Integration Testing

All e2e tests run against a small example database (e2e/.data/). That database is created from the downloaded datasets using the test:e2e:create-db script. It is committed so that you don't have to download all datasets before being able to run e2e tests.

npm run test:e2e            # run all integration tests (seeds DB automatically)
npm run playwright-ui       # same, but with Playwright's interactive UI
npm run test:e2e:update     # updates screenshots

Future inspiration

These Main topic classifications are not what I have

Wikipedia:Contents

why not reddit

Wikipedia:Categorization

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
e2e		e2e
plan		plan
public		public
scripts		scripts
src		src
tailscale		tailscale
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DEPLOY.md		DEPLOY.md
DIARY.md		DIARY.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.test.yml		docker-compose.test.yml
eslint.config.mjs		eslint.config.mjs
jest.config.ts		jest.config.ts
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrollsurf

Getting Started

Clicks, Likes & Dislikes

Example

The SQL behind it

Cookies, Accounts & Login

Passwordless email login

Merging / switching user histories on login

(`merge_anon_to_account`)

Consent & revoking it

Integration Testing

Future inspiration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scrollsurf

Getting Started

Clicks, Likes & Dislikes

Example

The SQL behind it

Cookies, Accounts & Login

Passwordless email login

Merging / switching user histories on login

(merge_anon_to_account)

Consent & revoking it

Integration Testing

Future inspiration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

(`merge_anon_to_account`)

Packages