Build a research pipeline that finds and synthesizes information on a schedule

This guide builds a research pipeline that finds, reads, and synthesizes information from the web on a schedule. You define the topics. Your agent does the searching, filters for quality, and delivers a structured summary to your inbox or Telegram. No subscriptions. No third-party services. The research runs on your server, on your schedule, at your cost.

TL;DR

A research pipeline is a cron job that searches for new information on your chosen topics, evaluates what it finds, and produces a structured report. You get fresh research delivered on a schedule without opening a browser. Setup takes about 90 minutes. Running cost: under $0.05 per day for most use cases at March 2026 pricing.

What a research pipeline actually does

The phrase gets used loosely. This guide is specific: a research pipeline is a scheduled process that performs web searches on defined topics, reads the content at the result URLs, filters for relevance and quality, synthesizes what it finds into a structured output, and delivers that output to you.

It does not hallucinate sources. It does not summarize from memory. It reads actual current pages and tells you what they say. The difference matters for anything time-sensitive: market intelligence, competitive monitoring, news tracking, technical documentation updates.

What it’s good for:

  • Monitoring a competitive landscape (new product announcements, pricing changes, blog posts)
  • Tracking a technology area (new releases, vulnerability disclosures, RFC updates)
  • Industry news synthesis (multiple sources filtered and summarized by topic)
  • Academic paper monitoring (new publications in a research area)
  • Price and availability monitoring (products, services, job postings)

What it’s not good for:

  • Real-time data (stock prices, live sports scores, anything requiring sub-minute freshness)
  • Paywalled content (the agent can’t read what it can’t access)
  • Deep investigative research requiring human judgment about source credibility

Before you build: defining your topics

The quality of the research pipeline is almost entirely determined by how well you define the topics. Vague topics produce vague research. Specific topics produce specific research.

Bad topic definition: “AI news”

Good topic definition: “New model releases from Anthropic, OpenAI, Google DeepMind, and xAI. Specifically: new model announcements, benchmark results, API pricing changes, and capability updates. Not: opinion pieces, tutorials, or business news unrelated to model capabilities.”

I want to build a research pipeline. Before we set up the cron job, help me define the topics correctly. For each topic I describe, refine it into a precise research brief with: (1) the core question I’m trying to answer, (2) the specific sources or site types to prioritize, (3) what to exclude, (4) the right search frequency (daily, weekly, or on-demand). Here are my topics: [paste your topics].

The research brief file

Save your topic definitions in a research brief file the cron job reads on every run:

Create /research/BRIEF.md with this structure for each topic: ## [Topic Name], Frequency: [daily/weekly/on-demand], Core question: [one sentence], Priority sources: [list], Exclude: [list], Output format: [bullet summary / structured report / raw notes]. I’ll use this file as the source of truth for what to research and how to format it.

The output format field matters. A bullet summary takes 30 seconds to read. A structured report takes 3 minutes. For daily monitoring topics, bullet summaries are usually right. For weekly deep dives, structured reports give you more to work with.

Setting up the search cron job

The search cron job is the core of the pipeline. It runs on the schedule defined in BRIEF.md, performs the searches, and writes results to a staging file for synthesis:

Create a cron job that runs daily at 6 AM. Task: Read /research/BRIEF.md and identify all topics with Frequency: daily. For each topic: (1) perform 3 web searches using the core question and related queries, (2) for each search result, fetch the content at the URL and read it, (3) filter out results that match the Exclude criteria, (4) write the raw findings to /research/staging/YYYY-MM-DD-[topic-name].md. Do not synthesize yet. Just collect and write. Use ollama/phi4:latest. No Telegram notification at this stage.

Why separate collection from synthesis

Running search collection and synthesis as separate steps costs less and produces better results. Collection with phi4 is cheap. Synthesis with a smarter model (deepseek-chat or Sonnet for complex topics) is more expensive but only runs on already-filtered material. Splitting also means the staging file is available for manual inspection before synthesis runs, which is useful during the first week of tuning.

The synthesis cron job

The synthesis job runs after collection, reads the staging files, and produces the final report:

Create a cron job that runs daily at 7 AM (one hour after collection). Task: For each staging file in /research/staging/ dated today: read the raw findings, read the corresponding topic brief from BRIEF.md, synthesize the findings into the output format specified in the brief. Write the synthesized report to /research/reports/YYYY-MM-DD-[topic-name].md. After all topics are synthesized, compile a daily digest at /research/reports/YYYY-MM-DD-digest.md that includes the top 3 findings across all topics. Send me the digest via Telegram. Use deepseek/deepseek-chat for synthesis.

What good synthesis looks like

The synthesis prompt inside the cron job instruction determines what you get back. Here’s the difference between a weak and a strong synthesis prompt:

Weak: “Summarize the findings.”

Strong synthesis instruction to embed in your cron job:

For each staging file, synthesize as follows: (1) Lead with the single most important finding from today’s research, stated as one concrete sentence. No hedging. If nothing significant happened, say “No significant developments today.” (2) List up to 5 additional findings in order of relevance to the core question. Each finding must include: what happened, source URL, and why it matters for the core question. (3) Flag anything that directly contradicts a previous finding or represents a significant change from the prior report. (4) Close with one sentence: the open question this research raises that I should watch next. Do not pad. Do not include findings that don’t answer the core question.

The “lead with the most important finding” instruction is the most valuable part. It forces the synthesis to make a judgment about what matters rather than listing everything equally. Over time, tuning this instruction to match what you actually care about is where the quality gains happen.

Source quality filtering

Not all search results are worth reading. The pipeline needs a filtering layer to skip low-quality sources before the synthesis stage:

Add to BRIEF.md a global ## Source Quality Rules section: Skip any result that is: (1) from a content farm or SEO spam site (check: does the URL have more than 3 directory levels? does the title match the query word-for-word with no variation?), (2) older than 30 days for daily topics and older than 90 days for weekly topics, (3) paywalled (unable to read full content after fetch), (4) a press release from the company being researched (flag it but don’t use it as the primary source). Apply these rules during the collection step before writing to the staging file.

Building a source allowlist

For specific research areas, you know which sources are authoritative. An allowlist beats filtering heuristics for those domains:

Create /research/SOURCES.md. Structure: one section per topic with a list of authoritative source domains for that topic. During collection, prioritize results from these domains over others when available. If an allowlisted source has a new article on the topic, always include it regardless of other filtering rules. If it doesn’t appear in search results, try fetching [domain]/blog, [domain]/news, and [domain]/releases directly.

Storing research history

The most valuable thing the pipeline builds over time is a history of findings. A report from today is useful. A year of reports on the same topic tells you how a landscape is shifting.

After each synthesis run, append the top finding for each topic to /research/history/[topic-name]-log.md with a date header. This log should never be overwritten, only appended. On the first of each month, read the full log for each topic and identify: (1) what changed most significantly over the past 30 days, (2) any trend that’s become consistent, (3) anything that was flagged as important in prior reports but hasn’t appeared since. Send me a monthly trend summary via Telegram.

Using history to improve queries

After 30 days of collection, the history log tells you which search queries actually surfaced useful results and which consistently found noise. Use this to refine BRIEF.md:

Read the staging files from the past 14 days for topic [name]. For each day, count how many of the collected results were actually included in the final synthesis vs. filtered out. Which search queries consistently produced useful results? Which consistently produced noise? Suggest 3 query refinements that would improve signal quality for this topic. Do not make changes yet, just report the recommendations.

SOTA model recommendations for research pipelines (March 2026)

Collection step

Use ollama/phi4:latest for the collection step. It handles web search, URL fetching, content reading, and basic quality filtering reliably. At 14.7B parameters locally, it processes a typical search result in 5-10 seconds. For 3 searches per topic and 5 results per search, collection for one topic takes roughly 2-3 minutes. Zero API cost.

Synthesis step

Use deepseek/deepseek-chat for standard synthesis. It produces well-structured reports with good judgment about what to lead with. At approximately $0.001-0.003 per synthesis run for a typical topic, a pipeline with 5 daily topics costs under $0.05/day. Synthesis quality significantly exceeds phi4 for the judgment calls: what’s most important, what changed, what the open question is.

Deep synthesis on complex topics

For topics requiring nuanced judgment (competitive intelligence where you need to read between the lines, technical analysis where accuracy is critical), route the synthesis to anthropic/claude-sonnet-4-6. Reserve this for weekly deep-dive topics where the cost is justified. At roughly $0.05-0.15 per complex synthesis, a weekly deep dive on one topic costs under $1/month.

Advanced: triggered research

Scheduled research runs whether or not anything happened. Triggered research runs when something specific happens. Both have their place:

Create a triggered research cron job that runs every 4 hours and checks for specific events: (1) any new post from the 5 company blogs in my SOURCES.md allowlist, (2) any new GitHub release for repositories listed in /research/REPOS.md, (3) any mention of my product name in the sources I monitor. If any trigger fires, immediately run collection and synthesis for that topic and send me a Telegram notification with the finding. Do not wait for the scheduled daily run.

Keyword alert integration

For specific terms you always want to know about immediately:

Create /research/ALERTS.md with a list of terms that should always trigger an immediate Telegram notification when found in any research collection: [your terms]. During every collection run, after writing to the staging file, scan the raw content for these terms. If any appear, send a Telegram message: “Alert: [term] found in today’s [topic] research. Source: [URL]. Snippet: [relevant 2-sentence excerpt].” Use ollama/llama3.1:8b for this scan since it’s simple text matching.

Cost breakdown (March 2026 pricing)

  • Collection (ollama/phi4:latest, daily, 5 topics): $0.00, runs locally
  • Synthesis (deepseek-chat, daily, 5 topics): ~$0.01-0.05/day
  • Weekly deep dive (Sonnet, 1 topic): ~$0.05-0.15/week
  • Monthly trend analysis (deepseek-chat): ~$0.005/month
  • Keyword alert scanning (llama3.1:8b, free): $0.00
  • Total for a 5-topic daily pipeline: Under $2/month

Troubleshooting common issues

Search results are consistently off-topic

The core question in BRIEF.md is too vague, or the Exclude criteria are missing. Tighten the core question to include the specific entities you care about (named companies, named technologies, named people). Add 3-5 concrete exclusion terms based on what keeps showing up in the noise.

Synthesis is too long to read quickly

The synthesis instruction doesn’t have a word limit. Add “Maximum 200 words for bullet summary topics, 400 words for structured report topics. If you can’t fit within the limit, prioritize the top 3 findings and cut the rest.” Concise synthesis requires explicit constraints.

The pipeline is missing things I know happened

The search queries aren’t finding the right sources. Check the staging file for that day to see what the collection step actually found. If a known source isn’t appearing in results, add it to your SOURCES.md allowlist and fetch it directly on each run. Don’t rely on search alone for must-have sources.

Staging files are accumulating and using disk space

Add a cleanup cron job that runs weekly and deletes staging files older than 14 days. The synthesized reports in /research/reports/ and the history logs in /research/history/ should be kept indefinitely. The staging files are working files only and can be deleted after synthesis confirms completion.

Frequently asked questions

Can I run research pipelines on competitor products without getting blocked?

Fetching public web pages is legal and normal. Most sites allow it. The practical issue is rate limiting: if the collection job fetches 20 URLs from the same domain in 2 minutes, some sites will block the IP or return errors. Add a delay between URL fetches for the same domain: “After fetching any URL, wait 10 seconds before fetching another URL from the same domain.” This keeps the pipeline well-behaved without slowing total collection significantly.

How do I handle research topics where the web search quality is poor?

For niche topics where general web search doesn’t surface the right sources, switch to direct fetching from your allowlist. Instead of searching and then fetching, fetch directly from the 5-10 sources you know are authoritative for that topic and check for new content since the last run. This works better for technical topics (GitHub repos, specific blogs, documentation sites) where you know exactly where the information lives.

Can the pipeline read PDFs and documents, not just web pages?

Yes. The fetch step can retrieve PDFs. The agent can extract text from them. For research areas where the primary sources are PDF reports (academic papers, government reports, industry research), add a PDF fetch step to the collection cron job: “For any result URL ending in .pdf or linking to a downloadable document, fetch the PDF and extract the text before applying quality filtering.”

How do I share research reports with my team?

The simplest approach is a shared folder that the pipeline writes to, accessible via your team’s existing file sharing (Dropbox, Google Drive, a shared server path). More structured: the pipeline posts the daily digest to a dedicated Discord channel or Slack channel where your team reads it. Add to the synthesis cron: “After writing the digest to /research/reports/, also post it to Discord channel [research-updates] using the Discord message tool.”

Can I use this for academic literature monitoring?

Yes. For academic papers, add arXiv, Semantic Scholar, and Google Scholar to your SOURCES.md allowlist. The collection step should search for papers by keyword and author name, not just general web search. arXiv has an API that returns structured results, which is more reliable than web scraping. Add to the collection instruction: “For academic topics, query the arXiv API at https://export.arxiv.org/api/query?search_query=[terms]&start=0&max_results=10 and parse the XML response.”

What’s the right amount of topics to monitor?

Start with 3. After two weeks, you’ll know which topics are producing useful output and which are noise. Add more only after the first 3 are tuned. Most operators find that 5-7 well-defined daily topics and 2-3 weekly deep-dive topics is the right sustained load. Above 10 daily topics, collection time runs long and the digest becomes too long to read, which defeats the purpose.

Building a competitive intelligence pipeline

Competitive intelligence is one of the highest-value applications for a research pipeline. You want to know when a competitor ships something, changes pricing, posts a job that signals a strategic direction, or gets mentioned in coverage you should be aware of. Done manually, this takes 30-60 minutes per week and still misses things. Done right with a pipeline, it takes zero of your time and misses much less.

What to monitor

For each competitor, define a monitoring brief that covers four layers:

  • Product changes: new features, pricing updates, deprecations, API changes. Source: their official blog, changelog, and release notes page fetched directly.
  • Job postings: what they’re hiring tells you what they’re building. A competitor who just posted 5 ML engineer roles is building something that will ship in 6-12 months. Source: their jobs page and LinkedIn.
  • Public mentions: what people are saying about them. Source: web search for “[competitor name] review”, “[competitor name] alternative”, “[competitor name] problem”.
  • Founder and executive activity: blog posts, conference talks, podcast appearances. What leadership talks about publicly usually precedes product direction. Source: their personal blogs and Twitter/X.

Add a competitive intelligence section to BRIEF.md for each competitor I track. For each competitor: Frequency: weekly. Core question: what changed this week that I need to know about? Priority sources: [competitor blog URL], [competitor changelog URL], [competitor jobs URL]. Exclude: press releases about funding rounds (unless over $50M), generic roundup articles, anything older than 7 days. Output format: structured report with sections: Product Changes, Hiring Signals, Public Sentiment, Executive Signals. If nothing changed in a category, write “No change this week.”

The “so what” layer

Raw competitive intelligence is only useful if it’s interpreted. Add an interpretation step to the synthesis:

After synthesizing the competitive report for each competitor, add a “So What” section. For each finding, answer: does this affect me directly? If yes, how should I respond and on what timeline? If no, is this worth tracking for future reference? The So What section should be concrete: “This is a direct feature overlap with our X capability. If they ship this in the next quarter, we need Y.” Not: “This might be relevant.” Use deepseek/deepseek-chat for this interpretation layer.

Building a technical monitoring pipeline

For developers and technical operators, a research pipeline that tracks the technical landscape is more valuable than general news. This means: new library releases, CVE disclosures, RFC changes, deprecation notices, and infrastructure updates that affect your stack.

Create a technical monitoring brief in BRIEF.md. Frequency: daily. Fetch these sources directly rather than searching: (1) https://github.com/[repo]/releases/latest for each repository in REPOS.md, (2) https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-recent.json for CVEs tagged with my technology stack keywords, (3) the changelog pages for each major dependency listed in /research/DEPS.md. For each new release or CVE that matches my stack, write a summary: what changed, severity (for CVEs), action required (yes/no), and recommended response. Send Telegram notification for any CVE with CVSS score above 7.0 or any major version bump in a critical dependency.

Dependency tracking

Create /research/DEPS.md with a list of my critical dependencies: package name, current version, ecosystem (npm/pip/cargo/etc), and criticality (critical/high/medium). During the daily technical monitoring run, check whether a newer version exists for each critical and high dependency by fetching the package registry API. If yes, note the version delta and changelog summary. If the new version is a major version bump, flag it as requiring manual review before upgrade. Send a weekly “dependency update summary” to Telegram every Monday morning.

Running a market intelligence pipeline

Market intelligence is broader than competitive intelligence: it’s the overall landscape your product operates in. Trends, regulatory changes, funding activity in your space, analyst reports, and macro signals that affect your market.

Add a market intelligence section to BRIEF.md. Frequency: weekly, runs Sunday at 8 PM so I have it Monday morning. Core question: what happened in [your market] this week that changes the landscape I’m operating in? Priority sources: TechCrunch, VentureBeat, relevant trade publications, and 3 industry analyst blogs. Exclude: generic how-to content, product reviews not related to the market category, anything more than 7 days old. Output format: structured report with sections: Funding Activity, Regulatory/Policy, Analyst Coverage, Macro Signals, Emerging Patterns. Under Emerging Patterns, identify any theme that appeared in 3 or more separate sources this week.

Funding activity tracking

Funding rounds in your space signal where capital is flowing and what problems investors think are worth solving. This is forward-looking intelligence that search results often miss until weeks later:

During the weekly market intelligence run, also fetch Crunchbase news and search for “[market category] funding 2026 [recent month]”. For each funding round found: company name, amount, stage, lead investor, and a one-sentence description of what they build. Flag any round over $10M as significant. Include a brief note on whether the funded company competes with, complements, or is adjacent to my product.

Integrating research output with your workflow

A research pipeline that produces reports you don’t read is a waste of infrastructure. The delivery format determines whether the research gets used:

Telegram digest format

For daily monitoring, a Telegram message is the right delivery mechanism. The constraint of Telegram’s message format (short, no complex formatting) forces the synthesis to be genuinely concise:

When sending the daily digest to Telegram, format it as: [date] Research Brief. Then for each topic with a significant finding today: bold topic name, one sentence finding, URL. For topics with no significant findings: skip entirely. Maximum 10 lines total. If there are more than 5 significant findings, include only the top 5 by importance and add “Full report: /research/reports/[date]-digest.md” at the bottom.

Weekly email digest

For weekly deep-dive reports, email is often better than Telegram. The longer format works, and you can read it properly rather than on a phone screen:

After generating the weekly research reports, compile them into a single email digest and send to [your email]. Format: plain text, no HTML. Subject line: “Weekly Research: [date]: [top finding in 8 words]”. Body: one section per topic, each starting with a bold one-sentence summary. Include source URLs. Close with “Open questions: [3 things the research raised that need watching].” Use deepseek-chat to write the subject line and open questions section.

Discord channel for team sharing

After generating each weekly competitive intelligence report, post a summary to Discord channel [research-updates]. Format for Discord: use ** for bold headers, bullet points for findings, include the full source URL for each finding. Pin reports that contain significant findings (anything that affects our product directly). Use the discord message tool with action=send, target=[channel-id].

Validating research quality

After two weeks of running, validate that the pipeline is producing useful output. Three checks:

First week checklist

During the first week, treat the pipeline as being in calibration mode. The reports are not yet trustworthy, but the errors are telling you exactly what to fix. Run through this checklist on day 7:

  • Read all 7 daily digest reports. Circle findings that were genuinely useful. Put an X next to findings that were noise or irrelevant.
  • Check the staging files for days where the digest felt thin. Were there actually useful results that the synthesis filtered out, or was collection itself the problem?
  • Open BRIEF.md and update the Exclude criteria based on what you marked as noise. Be specific: add exact site domains, topic phrases, or content types to exclude.
  • Check the Unknown/error log if any collection runs failed. Update the SOURCES.md allowlist with any authoritative sources that weren’t showing up in search results.
  • Adjust the synthesis word limit if the reports are too long to read in one sitting. Most operators need to tighten this after week one.

After the calibration week, the pipeline should require no more than 5 minutes of tuning per month to stay useful.

Coverage check

Think back over the past 14 days: did anything happen in my research topics that I noticed through other means (colleague mention, social media, direct notification) that the pipeline didn’t surface? List those misses. For each one, trace why the pipeline missed it: wrong search query, excluded by quality filter, source not in allowlist, or happened too fast for the collection schedule. Use the misses to update BRIEF.md.

Signal-to-noise check

Read the staging files from the past 7 days. Count: total results collected vs. results included in synthesis. What percentage made it through? If more than 60% of collected results are being filtered out in synthesis, the collection step is working too broadly. Tighten the search queries or add exclusion criteria to BRIEF.md. If fewer than 20% are being filtered, either the collection is too narrow or the quality filtering is too permissive.

Actionability check

The most important question: did anything in the past two weeks of research cause you to take a specific action you wouldn’t have taken without it? If the answer is no, the pipeline is producing interesting information but not intelligence. The fix is usually in the So What layer of the synthesis prompt, not in the collection step.

Common reasons research doesn’t drive action: the synthesis isn’t making judgments about relevance (it’s listing findings equally instead of prioritizing), the topics are too broad to produce specific findings, or the delivery format buries the important things. Fixing the synthesis prompt is almost always the highest-leverage change. The collection can be perfect and the research still useless if synthesis doesn’t tell you what to do with it.

What a well-tuned pipeline looks like after 90 days

After 90 days of running and calibrating, a well-built research pipeline produces something qualitatively different from what it produced in week one. Here’s what to expect:

  • The history log becomes genuinely useful. A topic you’ve been tracking for 90 days has a searchable history of what changed and when. That history is context no search engine gives you. When a competitor makes a move, you can read back 90 days of monitoring and understand whether it’s a new direction or a pattern they’ve been building toward.
  • The exclusion criteria get tight. After 90 days you’ve seen most of the noise patterns for your topics. The staging files are leaner. Synthesis runs faster because there’s less to filter.
  • The synthesis starts surfacing things you didn’t know to look for. The “Emerging Patterns” and “Open Questions” sections of the synthesis improve as the model has more history context to work from. Things that looked isolated in week one look like patterns in month three.
  • You stop checking manually. Early on, most operators still check sources directly because they don’t trust the pipeline yet. By month three, the pipeline has earned enough trust that the digest is the primary read. This is when the time savings become real.

The 90-day timeline is not arbitrary. Research pipelines need history to produce intelligence. The first 30 days are calibration. Days 30-60 are the pipeline producing reliable but shallow output. Days 60-90 are where the pattern recognition starts. Build it expecting to run it for 90 days before judging whether it’s working.

Frequently asked questions

How many web searches can the pipeline run per day without hitting rate limits?

The web search tool (Brave API) used by OpenClaw has a default rate limit of 1 request per second. For a pipeline with 5 topics and 3 searches per topic, that’s 15 searches per day, well within limits. If you’re running triggered research in addition to scheduled, add a 3-second delay between searches to stay safe. The collection cron job should never run searches in parallel, always in sequence.

Can the pipeline read content behind a login wall?

No. The agent can only read publicly accessible URLs. If a critical source requires login (LinkedIn, paywalled publications), you have two options: (1) find the same information through public RSS feeds or alternative URLs, or (2) manually export the relevant content to a file the pipeline reads. Some publications offer free email newsletters with the same content as their paywalled articles. Those newsletters are often better source material anyway.

How do I handle research on very fast-moving topics where daily is too slow?

Increase the collection frequency to every 4 hours for that specific topic. Keep the synthesis at daily. Running synthesis every 4 hours is expensive and produces more noise than signal unless you have very specific triggers. The triggered research cron job described earlier is the right pattern: collect every 4 hours, synthesize only when a trigger fires.

What’s the best way to search for academic papers?

arXiv’s API is the most reliable for pre-prints. For published research, Semantic Scholar has a free API (semanticscholar.org/api-docs) that supports keyword and citation search. Add these to your collection step instead of relying on general web search for academic topics. The structured API responses are easier to parse and filter than scraped web pages. For citation tracking (knowing when a specific paper gets cited by new work), Semantic Scholar’s citations endpoint is the right call: GET https://api.semanticscholar.org/graph/v1/paper/[paper-id]/citations. You can track 5-10 papers this way on a weekly schedule without hitting rate limits.

Can I have the pipeline monitor social media?

Twitter/X API access requires a developer account and is now paid. Reddit has a public API with reasonable rate limits. For monitoring discussions about your topic on Reddit, use the search endpoint: https://www.reddit.com/search.json?q=[terms]&sort=new&t=week. This doesn’t require authentication for read access and returns JSON. Add it to collection for topics where community discussion is a signal you care about.

How do I know when to add a new topic vs. refine an existing one?

Add a new topic when you have a distinct question that isn’t answered by existing topics. Refine an existing topic when the output isn’t matching what you actually need. The most common mistake is adding new topics when the real problem is that existing topics are defined too broadly. Before adding a topic, check: does any current topic partially cover this? If yes, refine that one first.

What happens if a source URL changes or goes down?

The collection step will log an error for that URL and continue with other sources. After the run, check /research/logs/ for fetch errors. For allowlist sources that fail more than twice, update the URL in SOURCES.md. For search-based collection, a failed URL fetch just means that result is skipped, which is acceptable. The other results still go through. Add to your cron job: “After collection, count URLs that returned errors. If more than 30% of fetches failed, send a Telegram alert: ‘Research collection degraded: [X]% fetch failure rate today.’”

Keep Reading

Stay current