🔍 ContentLens-CLI

Lightweight Terminal Intelligent Web Content Extraction & Structuring Engine Zero Dependencies | Cross-Platform | Python 3.8+

中文 | 繁體中文 | English | 日本語 | Español

中文

🎉 项目介绍

ContentLens-CLI 是一款轻量级终端智能网页内容提取与结构化引擎，专为开发者、数据分析师和AI应用场景设计。它能够从任意网页中智能提取文章正文、链接、图片、元数据等内容，并以 JSON、Markdown、CSV、纯文本等多种格式输出。

灵感来源：在AI编程助手日益普及的今天，开发者频繁需要从网页获取结构化信息。然而现有工具要么依赖重量级浏览器引擎（如Playwright），要么需要安装大量第三方库。ContentLens-CLI 诞生于这一痛点——零外部依赖，纯Python标准库实现，开箱即用。

核心差异化亮点：

🚫 零依赖：仅使用Python标准库，无需安装任何第三方包
🧠 智能提取：基于文本密度+标签权重评分算法，自动识别正文区域
🎯 多模式提取：支持文章正文、链接列表、图片列表、元数据、全量结构化5种模式
📄 多格式输出：JSON、Markdown、CSV、纯文本四种输出格式
💾 本地缓存：SHA-256哈希映射 + TTL过期机制，避免重复请求
📦 批量处理：支持从文件读取URL列表，带进度条和速率限制
🌍 跨平台：Windows、macOS、Linux 全平台兼容

✨ 核心特性

🧠 智能正文提取：基于文本密度分析和HTML标签权重评分，自动识别并提取文章正文区域，有效过滤导航栏、广告、评论区等噪声内容
🔗 链接提取：智能提取页面所有链接，自动区分内部链接与外部链接，支持按类型过滤
🖼️ 图片提取：提取页面所有图片URL，支持按最小尺寸过滤，自动识别图片类型
📋 元数据提取：全面提取页面元数据，包括title、description、keywords、Open Graph标签、Twitter Card等
🔄 本地缓存引擎：基于SHA-256哈希的文件缓存系统，支持TTL过期策略，有效减少重复网络请求
📊 批量处理：支持从文本文件批量读取URL列表，内置速率限制和进度显示
🎨 美观CLI界面：ASCII艺术Banner、ANSI彩色输出、对齐表格、进度条

🚀 快速开始

环境要求：

Python 3.8 或更高版本
无需安装任何第三方依赖

安装：

# 从源码安装
git clone https://github.com/gitstq/ContentLens-CLI.git
cd ContentLens-CLI
pip install -e .

# 或直接使用（无需安装）
cd ContentLens-CLI
python -m contentlens.cli --help

基本使用：

# 提取文章正文（JSON格式）
contentlens extract https://example.com --type article --format json

# 提取文章正文（Markdown格式）
contentlens extract https://example.com --type article --format markdown

# 提取页面所有链接
contentlens extract https://example.com --type links --format json

# 提取页面所有图片
contentlens extract https://example.com --type images --format csv

# 提取页面元数据
contentlens extract https://example.com --type metadata --format markdown

# 全量结构化提取
contentlens extract https://example.com --type full --format json

# 保存到文件
contentlens extract https://example.com --type article --format markdown --output article.md

# 批量处理
contentlens batch urls.txt --type article --format json --output ./results

# 查看缓存信息
contentlens cache --info

# 清空缓存
contentlens cache --clear

# 查看版本
contentlens version

📖 详细使用指南

Python API 使用：

from contentlens import WebExtractor

# 创建提取器实例
extractor = WebExtractor()

# 一站式URL提取
result = extractor.extract_url("https://example.com")
print(result)

# 自定义提取
html = extractor.fetch("https://example.com", timeout=30)
article = extractor.extract(html, extract_type="article")
print(article["title"])
print(article["content"])

# 提取链接
links = extractor.extract(html, extract_type="links")
for link in links["links"]:
    print(f"{link['text']}: {link['url']}")

# 提取元数据
metadata = extractor.extract(html, extract_type="metadata")
print(metadata)

输出格式说明：

格式	说明	适用场景
`json`	结构化JSON数据	API集成、数据处理
`markdown`	Markdown格式文档	文档生成、笔记
`csv`	CSV表格数据	数据分析、Excel导入
`text`	纯文本	快速预览、简单场景

提取模式说明：

模式	说明	输出字段
`article`	文章正文提取	title, author, publish_date, content, word_count, reading_time
`links`	链接列表提取	url, text, type(internal/external), is_internal
`images`	图片列表提取	url, alt, width, height, srcset
`metadata`	页面元数据提取	title, description, keywords, og:, twitter:
`full`	全量结构化提取	以上所有内容的组合

💡 设计思路与迭代规划

设计理念：

极简主义：零外部依赖，纯标准库实现，降低使用门槛
智能优先：基于算法的内容提取，而非简单的正则匹配
开发者友好：清晰的CLI接口 + 完整的Python API

技术选型原因：

使用 html.parser 而非 BeautifulSoup：零依赖，标准库内置
使用 urllib 而非 requests：零依赖，标准库内置
基于文本密度+标签权重的提取算法：比正则更智能，比NLP更轻量

后续迭代规划：

添加JavaScript渲染支持（可选依赖模式）
支持自定义提取规则配置文件
添加PDF内容提取支持
支持输出为HTML格式
添加MCP Server集成

📦 打包与部署指南

本项目为纯Python工具库，无需打包为可执行文件。

# 安装为可执行命令
pip install -e .

# 安装后即可在任意目录使用 contentlens 命令

# 卸载
pip uninstall contentlens-cli

兼容环境：

Python 3.8+
Windows / macOS / Linux
无需网络代理（直连模式）

🤝 贡献指南

欢迎社区贡献！请遵循以下规范：

Fork 本仓库
创建特性分支 (git checkout -b feature/amazing-feature)
提交更改 (git commit -m 'feat: add amazing feature')
推送到分支 (git push origin feature/amazing-feature)
创建 Pull Request

提交规范：使用 Angular 提交格式

feat: 新增功能
fix: 修复问题
docs: 文档更新
refactor: 代码重构
test: 测试相关
chore: 构建/工具链相关

📄 开源协议

本项目基于 MIT License 开源。

繁體中文

🎉 專案介紹

ContentLens-CLI 是一款輕量級終端智慧網頁內容擷取與結構化引擎，專為開發者、資料分析師和AI應用場景設計。它能夠從任意網頁中智慧擷取文章正文、連結、圖片、元資料等內容，並以 JSON、Markdown、CSV、純文字等多種格式輸出。

靈感來源：在AI程式設計助手日益普及的今天，開發者頻繁需要從網頁獲取結構化資訊。然而現有工具要麼依賴重量級瀏覽器引擎，要麼需要安裝大量第三方函式庫。ContentLens-CLI 誕生於這一痛點——零外部依賴，純Python標準函式庫實現，開箱即用。

核心差異化亮點：

🚫 零依賴：僅使用Python標準函式庫，無需安裝任何第三方套件
🧠 智慧擷取：基於文字密度+標籤權重評分演算法，自動識別正文區域
🎯 多模式擷取：支援文章正文、連結列表、圖片列表、元資料、全量結構化5種模式
📄 多格式輸出：JSON、Markdown、CSV、純文字四種輸出格式
💾 本地快取：SHA-256雜湊映射 + TTL過期機制，避免重複請求
📦 批次處理：支援從檔案讀取URL列表，帶進度條和速率限制
🌍 跨平台：Windows、macOS、Linux 全平台相容

✨ 核心特性

🧠 智慧正文擷取：基於文字密度分析和HTML標籤權重評分，自動識別並擷取文章正文區域，有效過濾導覽列、廣告、評論區等雜訊內容
🔗 連結擷取：智慧擷取頁面所有連結，自動區分內部連結與外部連結，支援按類型過濾
🖼️ 圖片擷取：擷取頁面所有圖片URL，支援按最小尺寸過濾，自動識別圖片類型
📋 元資料擷取：全面擷取頁面元資料，包括title、description、keywords、Open Graph標籤、Twitter Card等
🔄 本地快取引擎：基於SHA-256雜湊的檔案快取系統，支援TTL過期策略
📊 批次處理：支援從文字檔案批次讀取URL列表，內建速率限制和進度顯示
🎨 美觀CLI介面：ASCII藝術Banner、ANSI彩色輸出、對齊表格、進度條

🚀 快速開始

環境要求：

Python 3.8 或更高版本
無需安裝任何第三方依賴

安裝：

# 從原始碼安裝
git clone https://github.com/gitstq/ContentLens-CLI.git
cd ContentLens-CLI
pip install -e .

# 或直接使用（無需安裝）
cd ContentLens-CLI
python -m contentlens.cli --help

基本使用：

# 擷取文章正文（JSON格式）
contentlens extract https://example.com --type article --format json

# 擷取文章正文（Markdown格式）
contentlens extract https://example.com --type article --format markdown

# 擷取頁面所有連結
contentlens extract https://example.com --type links --format json

# 擷取頁面所有圖片
contentlens extract https://example.com --type images --format csv

# 擷取頁面元資料
contentlens extract https://example.com --type metadata --format markdown

# 全量結構化擷取
contentlens extract https://example.com --type full --format json

# 儲存到檔案
contentlens extract https://example.com --type article --format markdown --output article.md

# 批次處理
contentlens batch urls.txt --type article --format json --output ./results

# 查看版本
contentlens version

📖 詳細使用指南

Python API 使用：

from contentlens import WebExtractor

# 建立擷取器實例
extractor = WebExtractor()

# 一站式URL擷取
result = extractor.extract_url("https://example.com")
print(result)

# 自訂擷取
html = extractor.fetch("https://example.com", timeout=30)
article = extractor.extract(html, extract_type="article")
print(article["title"])
print(article["content"])

輸出格式說明：

格式	說明	適用場景
`json`	結構化JSON資料	API整合、資料處理
`markdown`	Markdown格式文件	文件生成、筆記
`csv`	CSV表格資料	資料分析、Excel匯入
`text`	純文字	快速預覽、簡單場景

💡 設計思路與迭代規劃

設計理念：

極簡主義：零外部依賴，純標準函式庫實現，降低使用門檻
智慧優先：基於演算法的內容擷取，而非簡單的正規表示式匹配
開發者友善：清晰的CLI介面 + 完整的Python API

後續迭代規劃：

新增JavaScript渲染支援（可選依賴模式）
支援自訂擷取規則設定檔
新增PDF內容擷取支援
新增MCP Server整合

📦 打包與部署指南

# 安裝為可執行命令
pip install -e .

# 安裝後即可在任意目錄使用 contentlens 命令

# 解安裝
pip uninstall contentlens-cli

相容環境：Python 3.8+ / Windows / macOS / Linux

🤝 貢獻指南

歡迎社群貢獻！請遵循 Angular 提交格式：

feat: 新增功能
fix: 修復問題
docs: 文件更新
refactor: 程式碼重構

📄 開源協議

本專案基於 MIT License 開源。

English

🎉 Introduction

ContentLens-CLI is a lightweight terminal intelligent web content extraction and structuring engine, designed for developers, data analysts, and AI application scenarios. It intelligently extracts article content, links, images, metadata, and more from any web page, outputting in multiple formats including JSON, Markdown, CSV, and plain text.

Inspiration: As AI coding assistants become ubiquitous, developers frequently need structured information from web pages. Existing tools either depend on heavyweight browser engines (like Playwright) or require numerous third-party libraries. ContentLens-CLI was born from this pain point — zero external dependencies, pure Python standard library, ready to use out of the box.

Key Differentiators:

🚫 Zero Dependencies: Only Python standard library — no third-party packages needed
🧠 Smart Extraction: Text density + tag weight scoring algorithm for automatic content identification
🎯 Multi-Mode Extraction: 5 modes — article, links, images, metadata, full structured
📄 Multi-Format Output: JSON, Markdown, CSV, plain text
💾 Local Caching: SHA-256 hash mapping + TTL expiration mechanism
📦 Batch Processing: URL list from file with progress bar and rate limiting
🌍 Cross-Platform: Windows, macOS, Linux compatible

✨ Core Features

🧠 Smart Article Extraction: Text density analysis + HTML tag weight scoring to automatically identify and extract article body content, effectively filtering navigation bars, ads, comment sections, and other noise
🔗 Link Extraction: Intelligently extract all page links, auto-classify internal vs. external links, support filtering by type
🖼️ Image Extraction: Extract all image URLs, support minimum size filtering, auto-detect image types
📋 Metadata Extraction: Comprehensive page metadata extraction — title, description, keywords, Open Graph tags, Twitter Cards
🔄 Local Cache Engine: SHA-256 hash-based file caching with TTL expiration policy
📊 Batch Processing: Read URL lists from text files with built-in rate limiting and progress display
🎨 Beautiful CLI: ASCII art banner, ANSI colored output, aligned tables, progress bar

🚀 Quick Start

Requirements:

Python 3.8+
No third-party dependencies required

Installation:

# Install from source
git clone https://github.com/gitstq/ContentLens-CLI.git
cd ContentLens-CLI
pip install -e .

# Or use directly (no installation needed)
cd ContentLens-CLI
python -m contentlens.cli --help

Basic Usage:

# Extract article content (JSON format)
contentlens extract https://example.com --type article --format json

# Extract article content (Markdown format)
contentlens extract https://example.com --type article --format markdown

# Extract all links
contentlens extract https://example.com --type links --format json

# Extract all images
contentlens extract https://example.com --type images --format csv

# Extract page metadata
contentlens extract https://example.com --type metadata --format markdown

# Full structured extraction
contentlens extract https://example.com --type full --format json

# Save to file
contentlens extract https://example.com --type article --format markdown --output article.md

# Batch processing
contentlens batch urls.txt --type article --format json --output ./results

# View version
contentlens version

📖 Detailed Usage Guide

Python API:

from contentlens import WebExtractor

# Create extractor instance
extractor = WebExtractor()

# One-stop URL extraction
result = extractor.extract_url("https://example.com")
print(result)

# Custom extraction
html = extractor.fetch("https://example.com", timeout=30)
article = extractor.extract(html, extract_type="article")
print(article["title"])
print(article["content"])

# Extract links
links = extractor.extract(html, extract_type="links")
for link in links["links"]:
    print(f"{link['text']}: {link['url']}")

Output Formats:

Format	Description	Use Case
`json`	Structured JSON data	API integration, data processing
`markdown`	Markdown document	Documentation, notes
`csv`	CSV tabular data	Data analysis, Excel import
`text`	Plain text	Quick preview, simple scenarios

Extraction Modes:

Mode	Description	Output Fields
`article`	Article body extraction	title, author, publish_date, content, word_count, reading_time
`links`	Link list extraction	url, text, type(internal/external), is_internal
`images`	Image list extraction	url, alt, width, height, srcset
`metadata`	Page metadata extraction	title, description, keywords, og:, twitter:
`full`	Full structured extraction	Combination of all above

💡 Design Philosophy & Roadmap

Design Principles:

Minimalism: Zero external dependencies, pure standard library implementation
Intelligence First: Algorithm-based content extraction over simple regex matching
Developer Friendly: Clean CLI interface + complete Python API

Roadmap:

JavaScript rendering support (optional dependency mode)
Custom extraction rule configuration files
PDF content extraction support
HTML output format
MCP Server integration

📦 Installation & Deployment

# Install as executable command
pip install -e .

# Use contentlens command from anywhere after installation

# Uninstall
pip uninstall contentlens-cli

Compatible Environments: Python 3.8+ / Windows / macOS / Linux

🤝 Contributing

Contributions are welcome! Please follow the Angular commit convention:

feat: New feature
fix: Bug fix
docs: Documentation update
refactor: Code refactoring

📄 License

This project is licensed under the MIT License.

日本語

🎉 プロジェクト紹介

ContentLens-CLIは、開発者、データアナリスト、AIアプリケーションシーンのために設計された軽量端末インテリジェントWebコンテンツ抽出・構造化エンジンです。任意のWebページから記事本文、リンク、画像、メタデータなどをインテリジェントに抽出し、JSON、Markdown、CSV、プレーンテキストなどの複数の形式で出力します。

主な特徴：

🚫 ゼロ依存: Python標準ライブラリのみ使用
🧠 スマート抽出: テキスト密度+タグ重みスコアリングアルゴリズム
🎯 マルチモード: 記事、リンク、画像、メタデータ、全構造化の5つのモード
📄 マルチフォーマット: JSON、Markdown、CSV、プレーンテキスト
🌍 クロスプラットフォーム: Windows、macOS、Linux対応

🚀 クイックスタート

# インストール
git clone https://github.com/gitstq/ContentLens-CLI.git
cd ContentLens-CLI
pip install -e .

# 基本使用
contentlens extract https://example.com --type article --format json
contentlens extract https://example.com --type links --format markdown
contentlens batch urls.txt --type article --format json --output ./results
contentlens version

📄 ライセンス

MIT License

Español

🎉 Introducción

ContentLens-CLI es un motor ligero de extracción y estructuración inteligente de contenido web para terminal, diseñado para desarrolladores, analistas de datos y escenarios de aplicaciones de IA. Extrae inteligentemente contenido de artículos, enlaces, imágenes, metadatos y más de cualquier página web, con salida en múltiples formatos incluyendo JSON, Markdown, CSV y texto plano.

Características principales:

🚫 Cero dependencias: Solo biblioteca estándar de Python
🧠 Extracción inteligente: Algoritmo de densidad de texto + puntuación de peso de etiquetas
🎯 Multi-modo: 5 modos — artículo, enlaces, imágenes, metadatos, estructurado completo
📄 Multi-formato: JSON, Markdown, CSV, texto plano
🌍 Multiplataforma: Windows, macOS, Linux

🚀 Inicio Rápido

# Instalación
git clone https://github.com/gitstq/ContentLens-CLI.git
cd ContentLens-CLI
pip install -e .

# Uso básico
contentlens extract https://example.com --type article --format json
contentlens extract https://example.com --type links --format markdown
contentlens batch urls.txt --type article --format json --output ./results
contentlens version

📄 Licencia

MIT License

Made with ❤️ by ContentLens Team
_{Zero Dependencies · Pure Python · Open Source}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/contentlens		src/contentlens
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🔍 ContentLens-CLI

中文

🎉 项目介绍

✨ 核心特性

🚀 快速开始

📖 详细使用指南

💡 设计思路与迭代规划

📦 打包与部署指南

🤝 贡献指南

📄 开源协议

繁體中文

🎉 專案介紹

✨ 核心特性

🚀 快速開始

📖 詳細使用指南

💡 設計思路與迭代規劃

📦 打包與部署指南

🤝 貢獻指南

📄 開源協議

English

🎉 Introduction

✨ Core Features

🚀 Quick Start

📖 Detailed Usage Guide

💡 Design Philosophy & Roadmap

📦 Installation & Deployment

🤝 Contributing

📄 License

日本語

🎉 プロジェクト紹介

🚀 クイックスタート

📄 ライセンス

Español

🎉 Introducción

🚀 Inicio Rápido

📄 Licencia

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages