Resources

Everything you need to get started and build with Arkus — documentation, templates, and insights to support your journey.

Explore topics

< Back to all posts

URL

What and how it can be used:

The URL component fetches content from one or more web pages, following links recursively. It acts as a web scraper that can crawl websites, download page content, extract data, and navigate through multiple pages by following hyperlinks. This component enables automated web content retrieval and site crawling within workflows.

When/how the component should be used:

  • Use when you need to fetch and extract content from one or more URLs, process it and return it in various formats.
  • Ideal for web scraping and data extraction from websites.
  • Use Chat Input to input a valid URL pointing to the desired content.
  • The component fetches and extracts the content.
  • Pass the output to Split Text for chunking if the content is long.
  • Send chunks to the Embedding Model or Language Model for processing.

Connections with other components:

  • Chat Output
  • Text Input
  • Agent Core
  • API Request
  • Directory
  • News Search
  • RSS Reader
  • SQL Database
  • Web Search
  • Language Model
  • If-Else
  • Batch Run
  • DataFrame Operations
  • LLM Router
  • Parser
  • Python Interpreter
  • Save File
  • Smart Function
  • Split Text
  • Structured Output
  • Type Convert
  • Listen
  • Loop
  • Notify
  • Smart Router
  • Calculator
  • Anonymization
  • Guardrail
  • Human-in-the-loop
  • Bing Search API
  • ChromaDB
In tool mode: 
  • Agent Core
  • Human-in-the-loop

Configurable settings:

  • URLs ( Write the URLs )
  • Depth

Default settings:

  • URLs ( Write the URLs )
  • Depth

Control Section:

  • URLs
  • Depth
  • Prevent Outside
  • Use Async
  • Output Format
  • Timeout 
  • Headers
  • Filter Text/HTML
  • Continue on Failure
  • Check Response Status
  • Autoset Encoding
  • Actions in tool mode
Default values: 
  • Depth = 1.00
  • Prevent Outside = on
  • Use Async = on
  • Output Format = Text
  • Timeout  = 30
  • Filter Text/HTML = on
  • Continue on Failure = on
  • Autoset Encoding = on
In tool mode: 
  • Actions = FETCH_CONTENT

Desired Behaviour:

  • Load content.
  • Clearly show the source.

< Back to all posts