ByteKitGet your free API key

rag ingestionStop embedding page chrome

ByteKit turns docs, blogs, help centers, changelogs, and sitemaps into clean markdown with useful metadata. Build a corpus from the page content, not the cookie banner and footer.

50 MB free. No credit card.

proof pointsWhy teams pick ByteKit for ingestion

Fast page capture

Straightforward pages can return in under a second.

Works on protected public sites

Same API shape when pages get harder.

Cost-aware rebuilds

Cache hits cost half. Failed zero-byte runs cost $0.

the problemYour vector store is full of navbars

Page chrome everywhere

Nav, cookie banners, footers, ads, popups. Your retriever now has opinions about menus.

Stale docs

Changelogs move. Docs update. Old embeddings keep answering like nothing happened.

Fragile ingestion

One template change breaks the corpus build.

code exampleTwo calls. One corpus.

1. Discover public URLs

POST /v1/sitemap
{
  "url": "https://stg.bytekit.com",
  "strategy": "standard",
  "webhook_url": "https://api.acme.com/bytekit-hook"
}

2. Capture discovered URLs in bulk

POST /v1/scrape/bulk
{
  "items": [{ "url": "...", "formats": ["markdown"] }],
  "webhook_url": "https://api.acme.com/bytekit-hook"
}

Result envelope: status, scrapeId, finalUrl, contentLength, formats

ingestion flowFrom website to corpus.

Cleaner inputs do not fix every retrieval problem. They do remove a lot of garbage before it reaches your vector store.

ByteKit
01 Discover sitemap or URL list
02 Capture pages as markdown with metadata
03 Store source markdown and metadata
Your app
04 Chunk + embed in your app

what you getSource material, not page furniture.

Cleaner markdown

Headings, links, text, and useful structure without the worst page furniture.

Source metadata

Title, final URL, content type, status, byte count, redirects.

Results as they finish

Submit known URLs and receive webhooks.

Public URL discovery

Walk a site's sitemap instead of hand-rolling discovery.

Cache-aware rebuilds

Trade freshness for cost when rebuilding known corpora.

Screenshots when needed

Capture visual context for QA.

good fitsWhere ByteKit fits.

ByteKit fits when your source is the public web and your downstream system needs text it can actually use.

  • Developer docs and API references
  • Product docs, help centers, changelogs
  • Public knowledge bases and policy pages
  • Blog archives and research libraries
  • Evaluation sets that need reproducible source snapshots

ByteKit is not your vector database, chunking framework, or answer ranking system. It gives you cleaner source data so the rest of the pipeline has less junk to fight.

Build the corpus from the content, not the layout.

50 MB free. No credit card.