An AI agent that lives on your Android phone.
You speak. BabyBot listens. It sees what's on your screen, drives your apps, reads your documents, remembers the things and people in your life, and brings your choice of LLM — cloud or local — to make the decisions.
"Send a WhatsApp to my brother saying I'll be 20 minutes late."
— Phone wakes. Opens WhatsApp. Finds the contact. Types the message. Sends. Reads back: "Done."
That's a real BabyBot session. Voice in, action out. You don't touch the phone.
Closed beta is open right now. Hop into the Discord, join the testers group, install from Play Store, talk to your phone.
There are a lot of AI apps on the Play Store. Most are chat wrappers. BabyBot isn't.
| Capability | ChatGPT app | Gemini | Claude app | Bixby / Assistant | BabyBot |
|---|---|---|---|---|---|
| Cloud LLM chat | ✅ | ✅ | ✅ | ✅ | ✅ |
| Run on local LLM (Ollama) | ❌ | ❌ | ❌ | ❌ | ✅ |
| Sees your screen | ❌ | partial | ❌ | partial | ✅ |
| Drives other apps for you | ❌ | partial | ❌ | partial (Google stack) | ✅ |
| Voice-driven hands-free | partial | ✅ | partial | ✅ | ✅ |
| Reads PDF / Word / Excel / PowerPoint | ❌ | partial | ❌ | ❌ | ✅ |
| Persistent visual memory of your world | ❌ | ❌ | ❌ | ❌ | ✅ |
| Hybrid web image identification | ❌ | ❌ | ❌ | ❌ | ✅ |
| No analytics, no ads, no tracking | ❌ | ❌ | ❌ | ❌ | ✅ |
The combination is the point. Lots of apps do one or two cells. BabyBot does the whole table.
A morning with BabyBot, roughly:
- "What's on my calendar today and the weather?" — answered in one breath
- "Let's go home" — turn-by-turn navigation starts in Google Maps
- "Text my wife that I'll be late" — opens Messages, types, sends
- "Send a WhatsApp to my brother saying 'on my way'" — opens WhatsApp, finds the contact, sends
- "What's in this PDF?" (attached) — summary in plain English, key points pulled out
- "What component is this?" (photo attached) — identifies the part, finds the spec sheet, summarizes
- "Is this my dog?" (photo attached) — checks against your saved photos, knows
- "Take a photo and tell me what's in it" — opens camera, you snap, it tells you
- "Set a timer for 25 minutes" — done, no taps
- "Read what's on my screen and email it to me" — accessibility-driven, opens email, types, sends
- "Remember this as my motorbike" — saved. Next month when you photograph it from a different angle, BabyBot knows it's yours.
No app switching. No copy-paste. No menus. Just talk to BabyBot.
BabyBot reads what's on screen and takes real actions in any app. Open WhatsApp, type a message, hit send. Compose an email in Gmail. Drive turn-by-turn navigation. Set alarms and timers. Take a photo. Control volume. Make a phone call. Send an SMS. Look up a contact. Copy something to your clipboard. All from voice commands, with the phone in your pocket if you want.
Powered by Android's Accessibility Service, which lets BabyBot interact with apps that don't expose a public API. This is why BabyBot can send WhatsApp messages with no Meta API access — it drives the actual WhatsApp UI the way you would.
Attach a photo and a five-stage on-device vision pipeline runs before the LLM sees the image:
- EXIF — date, location (if you allow), device that took the shot
- OCR — Google ML Kit text recognition, multi-resolution
- Content labels — ML Kit subject categorization with confidence scores
- Object detection — ML Kit object localizer with bounding boxes
- Visual embeddings — optional MobileNet V3 Large download (~22MB), enables cross-angle / cross-source identification
Results compose into a structured digest fed to the model. The effect: even small local models (Gemma 4B, Phi 4, Qwen 3B) make confident identifications because they're not working from raw pixels alone.
For online matching, BabyBot uses a hybrid perceptual hash + embedding pipeline to find products, places, components, and objects in web search results. It then pulls the source page content too — so the answer isn't just "this is X," it's "this is X, sold for Y, with features A, B, C."
This one's worth pausing on, because nobody else ships this.
Save photos of the things and people that matter to you — your pet, your car, your kids, the components in your workshop, your friends' faces. BabyBot builds a local visual memory unique to you. Show it a new photo later — could be days, months, years — and it knows whether that's your dog Max chilling on the sofa, or just a dog. Recognizes subjects across years, angles, lighting changes, growth.
No training required. No cloud round-trips. No accounts. Add a photo, it's instantly recognizable. Add 20 photos across different angles and lighting, it gets robust.
Cloud assistants forget your pet the moment the conversation ends. BabyBot remembers forever.
Attach a PDF, a Word doc, an Excel spreadsheet, a PowerPoint deck, source code in 80+ formats — BabyBot reads it natively. No copy-paste required.
"What does this contract say about termination?" — reads the PDF
"Find the highest Q3 number in this spreadsheet." — reads the Excel
"Summarize this slide deck." — reads the PowerPoint
"What's the main argument in this report?" — reads the Word doc
"Explain what this code does." — reads the source
Plus document generation in the other direction: BabyBot can write a PDF from your conversation, save it to your Downloads folder, hand it back as a tappable file chip in chat.
Beyond chat, BabyBot has a real tool layer that gets used during agentic runs:
- Multi-stage web research — query planning, parallel fetching, content extraction, ranking, deduplication. Not a toy search wrapper.
- Planner / worker mode — frontier model writes a spec, cheaper local model executes against it. Frontier-quality decisions at local-token cost.
- Continue button — long agentic tasks resume across sessions, app restarts, even days later
- Persistent context — your conversations save locally, encrypted at rest
- Built-in file editor — view, edit, save files inside BabyBot. IDE-lite on your phone.
- Image generation — synthesize new images when you ask for them
- Live map rendering — any location, inline in chat
Six backends. Configure them all. Switch between them mid-conversation.
- Anthropic — Claude family
- OpenAI — GPT-4 family, GPT-5, o-series
- Google Gemini — Pro and Flash
- OpenRouter — access dozens of models through one API
- Ollama — point at any local server (LAN or DDNS), use any model: Gemma, Llama, Qwen, Mistral, GLM, anything
- Custom — any OpenAI-compatible endpoint
Drop a frontier model for hard reasoning, switch to a local model for cheap follow-ups, keep the same conversation and tools. The agent stack doesn't care which provider answers.
If you only use Ollama with web search disabled, nothing leaves your network.
BabyBot doesn't roadmap features it hasn't built. Nothing in this README is a placeholder, a stub, or a half-finished experiment. If a capability is listed, it's because it survived the polish cycle before shipping to beta.
In a Play Store full of vibe-coded weekend AI apps where half the buttons don't work, that matters.
This README covers the headline capabilities. There's more. A lot more. Some tools and behaviours are intentionally undocumented because finding them is part of the fun.
Just ask BabyBot for things. If a capability is there, it'll use it. If it isn't, BabyBot will tell you honestly. The full surface area is bigger than this page — discover the rest by talking to your phone.
- No analytics from BabyBot itself. No third-party SDK, no crash reporter that phones home, no advertising IDs.
- No ads. Not now, not planned, not ever.
- Cloud LLM calls go directly to whichever provider you configured. There is no BabyBot server in the middle.
- Local mode with Ollama + web search off = nothing leaves your network.
- Location, contacts, photos are opt-in per-feature. Default off.
- API keys are stored encrypted on-device.
- Google ML Kit anonymous SDK telemetry is present because it ships with ML Kit's bundled libraries — same as every other Android app that uses Google's on-device ML stack. Not user data, just SDK metrics. Planned to be replaced with raw TFLite in a future version to eliminate this.
Coming next:
- Faster screen reading (the main UX bottleneck right now — multi-step automation can take 40-60s, target is under 15s)
- WorkManager-backed background ML downloads
- Per-component sub-image matching via on-device object detection
On the horizon:
- Android Auto support — voice-driven AI agent in your car. BYO LLM. Privacy-first alternative to Google Assistant in the dashboard.
- MobileCLIP integration for substantially better semantic image matching
- Improved local-model tool-call reliability across the Ollama model zoo
Long-term:
- iOS port (no committed timeline)
- Multi-modal voice (speak with feeling, not just transcribe)
Fastest way to get BabyBot on your phone:
- Hop into the Discord for chat, support, demos, and direct feedback → discord.gg/qaSrVm2cAg
- Join the testers group (required for Play Store closed-test access) → groups.google.com/g/cutitas85
- Wait ~5 minutes for the group access to propagate
- Open the Play Store link → play.google.com/store/apps/details?id=baby.bot.app
- Install, configure a backend (point at your Ollama or paste an API key), say hi in Discord
Looking for feedback specifically on:
- Real-world image identification — products, places, plants, components, anything
- Voice-driven multi-app automation — WhatsApp, Gmail, Maps, alarms, contacts
- Local Ollama setup pain points
- Battery / performance with the embedding tier active
- Any tool-call weirdness across different local models
- Any document-reading edge cases (heavily-formatted PDFs, multi-sheet Excels)
Feedback channels: the Discord (fastest, best for back-and-forth), Play Store internal-testing feedback, or open an issue on this repo.
BabyBot is built by Robert Nilges / PixCats
Other projects:
- SteamWidget — Steam stats widget for Windows
- MyLuck — also on the Play Store
BabyBot is closed-source software. This repository exists as a public face for the project — documentation, issue tracking, release notes, beta coordination. The application source code is not published.
Compiled APKs are distributed exclusively through Google Play (currently in closed testing). Sideloaded redistribution of the APK is not authorized. Reverse engineering, decompilation, and disassembly are prohibited except where expressly permitted by applicable law.
Copyright © 2025–2026 Robert Nilges / PixCats. All rights reserved.
Built in Android Studio, with too much coffee ☕