# Crawler

## Overview

The **Crawler Integration** enables automated data extraction from specified web sources by connecting your platform with external URLs. It allows organizations to fetch, index, and manage publicly available or authorized content for **Content Management** purposes. This integration is particularly useful for aggregating knowledge base articles, documentation, or website content into a centralized system for easy access and analysis.

### Prerequisites

Ensure the following requirements are fulfilled before initiating the integration:

* Access to **Base URL(s)** that need to be crawled.
* Required permissions to access the target web content (public or authorized sources).
* Access to the platform (**TheLoops**) with **Admin role** privileges.

### Best practices

* **Provide Valid and Accessible URLs:**\
  Ensure all Base URLs are correct, reachable, and authorized for crawling.
* **Use Structured Content Sources:**\
  Prefer well-structured websites or documentation portals for better data extraction quality.
* **Avoid Restricted or Dynamic Pages:**\
  Ensure the crawler has permission to access the content and avoid pages requiring frequent authentication or dynamic rendering.
* **Validate Integration Post Setup:**\
  Always perform a **Test Connection** to confirm successful configuration.
* **Optimize URL List:**\
  Add only relevant URLs to avoid unnecessary data processing and improve performance.

## Setup instructions <a href="#setup-instructions" id="setup-instructions"></a>

### Initiate Integration <a href="#step-1-initiate-integration" id="step-1-initiate-integration"></a>

{% stepper %}
{% step %}
Log in to **TheLoops** using an **Admin** account.
{% endstep %}

{% step %}
Navigate to **Settings → Integrations** module.
{% endstep %}

{% step %}
Click on the **“+ Add Integration”** button.
{% endstep %}

{% step %}
Search for **Crawler** integration and open it.
{% endstep %}
{% endstepper %}

### Configuration

{% stepper %}
{% step %}
In the **Configuration** tab, enter a suitable **Integration Name**.
{% endstep %}

{% step %}
The **Content Management** capability is selected by default for this integration.
{% endstep %}

{% step %}
In the **Base URL** field:

1. Enter the URL(s) you want to crawl.
2. Multiple URLs can be added as a list.
   {% endstep %}

{% step %}
Click on the **Connect** button.
{% endstep %}

{% step %}
Upon successful configuration:

* The integration will be added successfully.
* A success message will be displayed.
  {% endstep %}
  {% endstepper %}

### Verification

{% stepper %}
{% step %}
Navigate back to the **Integrations** module.
{% endstep %}

{% step %}
Verify that the newly added integration appears at the top of the list.
{% endstep %}

{% step %}
Click on the **Test Connection** icon (beside the delete icon).
{% endstep %}

{% step %}
If the test connection is successful, a confirmation message will be displayed, indicating that the integration is configured correctly.
{% endstep %}
{% endstepper %}

{% hint style="info" %}
Webhook configuration is **not required** for the Crawler integration, as it operates on a pull-based mechanism.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://kb.theloops.io/cxplatform/ifs-loops-cx-platform/integrations/crawler.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.