# Databank (Abstract Skeleton) Policy: scripts are executable, examples are stubs - All runnable entry points live under `scripts/` and can be executed directly with `python`. - Files under `src/databank/examples/` are stubs for guidance/docs only and will print instructions. - Prefer running demos via scripts: - `python scripts/seed_leagues_mongo.py` - `python scripts/seed_seasons_mongo.py` - `python scripts/test_get_league_match_list.py` This repository is a pure abstract skeleton intended to define stable contracts for a multi-spider data pipeline. It deliberately contains only abstract/base classes and core models, without any concrete implementations. Key modules (all abstract-only): - spiders: `BaseSpider` with clear lifecycle hooks and advisory attributes. - db: `BaseDB` defining minimal persistence operations and hooks. - reporter: `BaseReporter` defining reporting lifecycle. - scheduler: `RunnerBase` and `SchedulerBase` for coordination/scheduling. - analytics: `AnalyticsBase` for generic analytics pipelines. Guidelines to extend (no code here, only how-to): - Implementations MUST live in your own packages/modules and import these bases. - Do NOT modify the base interfaces unless you intend a breaking change. - Prefer composition and dependency injection over hard-coding dependencies. Implementing a spider (outline only): 1. Subclass `BaseSpider` and implement: - `build_payload(url) -> Payload` - `fetch(url, payload) -> str` - `parse(url, content, payload) -> Documents` 2. Optionally override hooks: - `on_run_start/on_run_end`, `should_fetch`, `before_fetch/after_fetch`, `transform`, `handle_error`, `close`. 3. Optionally honor advisory attributes like `max_retries`, `request_timeout_s`. Implementing a DB backend (outline only): 1. Subclass `BaseDB` and implement: `connect`, `ensure_indexes`, `insert_many`, `close`. 2. Optionally override hooks: `on_connect/on_close`, `before_insert/after_insert`. Implementing a reporter (outline only): 1. Subclass `BaseReporter` and implement: `notify_start`, `notify_success`, `notify_error`, `notify_summary`. 2. Optionally override `on_session_start/on_session_end`. Implementing a runner/scheduler (outline only): 1. Subclass `RunnerBase` to coordinate spiders -> DB -> reporters. 2. Subclass `SchedulerBase` to install/trigger schedules (e.g., via systemd/cron in your own code). Implementing analytics (outline only): 1. Subclass `AnalyticsBase` and implement `compute(data, **kwargs)`. 2. Optional staged hooks: `prepare`, `validate`, `transform`, `finalize`. Operations: systemd templates - See `ops/systemd/databank.service` and `ops/systemd/databank.timer`. - Customize `User`, `WorkingDirectory`, and `ExecStart` for your environment. Optional linting (no deps enforced) - A minimal Pylint config is included in `pyproject.toml` under `[tool.pylint.*]`. - You can run Pylint in your environment if desired, for example: - `pylint src/databank` (assuming Pylint is installed in your environment) - The config disables ABC-related false positives while keeping docstring checks. Optional typing and linting (ruff/mypy) - Minimal configs for Ruff and mypy are also included in `pyproject.toml`. - If you have them installed locally, example commands: - `ruff check src/databank` - `mypy src/databank` - Both are optional and will not run unless you invoke them. License - See `LICENSE` for details. Abstract-safe initialization helpers - `databank.config.settings`: - `DBSettings`: 通用数据库设置容器(不绑定具体后端)。 - `load_db_settings(prefix="DATABANK_DB_")`: 从环境变量读取设置(如 `DATABANK_DB_URI`、`DATABANK_DB_NAME` 等)。 - `settings_to_options(settings)`: 将 `DBSettings` 转换为通用 `configure(**options)` 所需字典。 - `merge_options(base, extra)`: 合并两份 options(右侧覆盖)。 - `databank.bootstrap.db`: - `DBBootstrapOptions`: 启动选项(是否 `connect`、`ensure_indexes`,以及 `configure_options`)。 - `bootstrap_db(db, options)`: 以抽象方式调用 `configure`→`connect`→`ensure_indexes`。 - `db_session(db, options)`: 上下文管理器,产出连接后的 DB,并在退出时安全关闭。 示例(仅展示编排,不包含具体后端实现): ```python from databank import config, bootstrap # 假设你有一个自定义的 DB 实现 `MyDB(BaseDB)`,此处仅示意。 from mypkg.db import MyDB # 你的实现,不在本仓库内 settings = config.load_db_settings() options = config.settings_to_options(settings) db = MyDB() boot = bootstrap.DBBootstrapOptions(configure_options=options, connect=True, ensure_indexes=True) with bootstrap.db_session(db, boot) as conn: # 在此使用 conn.insert_many([...]) 等抽象方法 pass ``` 以上编排层不引入任何具体驱动或后端,仅依赖于 `BaseDB` 约定,便于后续在你自己的实现中复用。