โ† All skills
Tencent SkillHub ยท Developer Tools

Arxiv Search Collector

Model-guided arXiv paper collection workflow that plans queries, fetches metadata, filters relevance, and merges deduplicated results by language.

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Model-guided arXiv paper collection workflow that plans queries, fetches metadata, filters relevance, and merges deduplicated results by language.

โฌ‡ 0 downloads โ˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
SKILL.md, scripts/fetch_queries_batch.py, scripts/fetch_query_metadata.py, scripts/init_collection_run.py, scripts/merge_selected_papers.py, references/io-contract.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
0.1.1

Documentation

ClawHub primary doc Primary doc: SKILL.md 12 sections Open source page

ArXiv Search Collector

Use this skill when you want model-led query planning and model-led relevance filtering.

Core Principle

Scripts are tools. The model performs the reasoning and decisions: Expand the original topic into multiple focused queries. Run one fetch command per query. Read each query result list and decide keep indexes. Merge kept items and dedupe with one script.

Step 1: Initialize Run

python3 scripts/init_collection_run.py \ --output-root /path/to/data \ --topic "LLM applications in Lean 4 formalization" \ --keywords "Lean 4,LLM,formalization" \ --categories "cs.AI,cs.LO" \ --target-range 5-10 \ --lookback 30d \ --language English This creates a run directory with task_meta.json, task_meta.md, query_results/, and query_selection/.

Language Parameter

--language must be set manually for each collection run. Use the same language value across all collector scripts for consistency. If --language is non-English (for example Chinese), generated markdown files are written in that language: task_meta.md query_results/<label>.md <arxiv_id>/metadata.md papers_index.md

Query Writing Requirements

Follow these rules before running per-query fetch: Determine query count from final target range. Prefer 3 queries for small/medium targets (2-5, 5-10). Prefer 4 queries for larger targets (10-50 or above). Avoid writing too many low-quality queries. Allocate target budget to each query, then oversample. Let target_max be the upper bound in target range. Compute target_per_query = ceil(target_max / query_count). Fetch each query with max_results = target_per_query * 2 (or * 3 when recall is more important). Example: target 5-10, query count 3 -> target_per_query=4 -> each query fetches 8-12. Keep one original-theme query, then add normalized/synonym expansions. Query 1 keeps original topic wording. Remaining queries use normalized terms and close synonyms. Prefer concise noun phrases that match arXiv indexing behavior. Use OR inside the same semantic group (synonyms), and AND across groups. Same-group synonyms should be connected with OR to increase recall. Example group A (model terms): LLM OR "large language model" OR AI. Example group B (Lean terms): "Lean 4" OR Lean OR "formal language". Different semantic groups should be connected with AND to keep relevance. Example: (LLM-group) AND (Lean-group). Recommended pattern: (<domain terms with OR>) AND (<method/model terms with OR>) [AND <optional constraint terms>]

Query Examples (arXiv API-ready)

Theme A: LLM applications in Lean 4 formalization all:"LLM applications in Lean 4 formalization" (all:"Lean 4" OR all:"Lean" OR all:"formal language") AND (all:"LLM" OR all:"large language model" OR all:"AI") (all:"Lean" OR all:"formalization") AND (all:"LLM" OR all:"large language model") AND all:"theorem proving" (all:"Lean" OR all:"proof assistant") AND (all:"AI" OR all:"LLM") Theme B: agentic tool use for code generation all:"agentic tool use code generation" (all:"agentic" OR all:"autonomous agent") AND (all:"LLM" OR all:"large language model") (all:"tool use" OR all:"function calling") AND (all:"coding assistant" OR all:"code generation") Theme C: multimodal reasoning with retrieval all:"multimodal reasoning retrieval" (all:"multimodal" OR all:"vision language") AND (all:"retrieval" OR all:"RAG") (all:"multimodal model" OR all:"vision language model") AND (all:"reasoning" OR all:"tool use")

Step 2: Fetch One Query at a Time

Model defines queries manually, for example: all:"Lean 4" all:"LLM formalization" all:"AI formal verification" Recommended batch mode (safe defaults, serial execution): python3 scripts/fetch_queries_batch.py \ --run-dir /path/to/run-dir \ --plan-json /path/to/query_plan.json In batch mode, the script auto-applies: serial API calls --min-interval-sec 5 --retry-max 4 --retry-base-sec 5 --retry-max-sec 120 --retry-jitter-sec 1 per-run rate-state file (<run_dir>/.runtime/arxiv_api_state.json) for throttling auto max_results from target_range and query count (default oversample x2, cap 60) default language/categories from task_meta.json Minimal query_plan.json only needs label and query. See references/query-plan-format.md. You normally do not need to set fetch-control args manually. If you need one-by-one manual fetch, run each query: python3 scripts/fetch_query_metadata.py \ --run-dir /path/to/run-dir \ --label lean4 \ --query 'all:"Lean 4"' \ --max-results 30 \ --min-interval-sec 5 \ --retry-max 4 \ --language English Output files: query_results/<label>.json (indexed full metadata list) query_results/<label>.md (human-readable preview) Date range is applied directly in arXiv API search_query via submittedDate:[... TO ...]. No second local date-filter pass is performed. Rate-limit controls in fetch_query_metadata.py: --min-interval-sec (default 5.0) --retry-max (default 4) --retry-base-sec (default 5.0) --retry-max-sec (default 120.0) --retry-jitter-sec (default 1.0) --rate-state-path (optional override; default is <run_dir>/.runtime/arxiv_api_state.json) --force to bypass cache and re-fetch

Step 3: Model Filters Relevance

For each query list, the model reads indexed results and decides what to keep. Use keep specs by index and/or arXiv ID when merging. To explicitly drop one weak query in later iterations, set that label to an empty keep list in selection-json.

Step 4: Merge and Dedupe

python3 scripts/merge_selected_papers.py \ --run-dir /path/to/run-dir \ --keep lean4:0,2,4 \ --keep llm-formalization:1,3 \ --language English or with selection-json: { "lean4-round1": [0, 2, 4], "lean4-round2": [], "formalization-round2": [1, 3, 5] } An empty list means this query label is intentionally dropped (keep 0). This writes final outputs: <arxiv_id>/metadata.json <arxiv_id>/metadata.md papers_index.json papers_index.md

Step 5: Iterative Retry Loop (Incremental)

If relevance is weak or final count is insufficient after Step 4, iterate: Review papers_index.md and per-paper metadata quality. Adjust query plan (usually broaden with additional synonym OR terms, keep cross-group AND constraints). Fetch additional query results with new labels. Re-run merge in incremental mode: python3 scripts/merge_selected_papers.py \ --run-dir /path/to/run-dir \ --incremental \ --selection-json /path/to/updated_selection.json \ --language English Incremental behavior: Previous label selections are loaded from query_selection/selected_by_query.json. Labels provided in the new selection-json override previous selections for those labels. New labels can be added. Old labels can be dropped by setting []. Stop retrying when: relevance is acceptable, or additional broadened queries mainly add low-relevance papers. If relevant papers are genuinely scarce, it is valid to finish below the original minimum target range.

Notes

Keep API concurrency conservative by controlling query count and --max-results. Keep per-query fetch serial (no parallel API calls in Stage A). Reuse cache by default for identical query/date/request settings; only use --force when necessary. Prefer default run-local rate-state so all steps in the same run share one cooldown/throttling state. If arXiv API returns 429 Too Many Requests, retry later and/or increase --min-interval-sec. Prefer explicit, narrow queries and let the model filter aggressively. Use references/io-contract.md for exact files and schema.

Related Skills

This skill is a sub-skill of arxiv-summarizer-orchestrator. Pipeline position: Step 1 (collection): arxiv-search-collector (this skill) Step 2 (per-paper processing): arxiv-paper-processor Step 3 (batch reporting): arxiv-batch-reporter This skill produces the initial paper-set structure and metadata that Stage B and Stage C depend on.

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
4 Scripts2 Docs
  • SKILL.md Primary doc
  • references/io-contract.md Docs
  • scripts/fetch_queries_batch.py Scripts
  • scripts/fetch_query_metadata.py Scripts
  • scripts/init_collection_run.py Scripts
  • scripts/merge_selected_papers.py Scripts