SearchBlox

SearchBlox Developer Documentation

Welcome to the SearchBlox developer documentation. Here you will find comprehensive technical documentation to help you start working with SearchBlox as quickly as possible, as well as support if you get stuck. Let's jump right in!

Guides

The SearchBlox HTTP-API enables you to index and search web content using simple HTTP POST and GET actions. The HTTP-API can add and delete HTTP collections, update paths and settings, schedule indexing and stop indexing the collection. Searchblox HTTP-API provides methods for working with HTTP collections of REST requests with JSON payloads.

Adding a Collection

Create a new Collection. You can create HTTP, File and Database collections using this API.
##Index URL
http://localhost:8080/searchblox/rest/collection/add

Method

POST

Media Type

application/json

Headers

content-type : application/json
accept: application/json

Document Syntax

Create HTTP Collection

{
      "apikey" : "61282E82E5D6D8D409EFC87E8415CDAA",
      "colname":"httpcollection",
      "coltype":"http",
         "language":"en"
}

Create File Collection

{
      "apikey" : "61282E82E5D6D8D409EFC87E8415CDAA",
      "colname":"filecollection",
      "coltype":"file",
      "language":"fr"
}

Create Database Collection

{
      "apikey" : "61282E82E5D6D8D409EFC87E8415CDAA",
      "colname":"dbcollection",
      "coltype":"db",
      "language":"de"
}

Document Description

JSON Fields

Value

apikey

API key accessible in the SearchBlox Admin Console. It is also present in the config.xml file.

colname

Name of the collection.

coltype

Type of the collection. The values given for HTTP, file and database collection are HTTP, file and db respectively.

language

Language of the collection specified in two-letter code
https://developer.searchblox.com/v8.6.5/docs/supported-languages#section-language-codes

Response Codes

Response Code

Message

601

Invalid API Key

50001

Collection Already Exists
Collection Type not found

5000

Collection created successfully

Deleting a Collection

You can delete collections using this API.

Index URL

http://localhost:8080/searchblox/rest/collection/delete

Method

POST

Media Type

application/json

Headers

content-type : application/json
accept: application/json

Document Syntax

{
      "apikey" : "61282E82E5D6D8D409EFC87E8415CDAA",
      "colname":"httpcollection"
}

Document Description

JSON fields

apikey

apikey

API key accessible in the SearchBlox Admin Console. It is also present in the config.xml file.

colname

Name of the collection.

Update the Collection Path

You can update HTTP collection path settings to configure rooturls, allowpaths, disallowpaths and formats for indexing.

Index URL

http://localhost:8080/searchblox/rest/collection/updatePath

Method

POST

Media Type

application/json

Headers

content-type : application/json
accept: application/json

Document Syntax

{
  "apikey": "61282E82E5D6D8D409EFC87E8415CDAA",
  "colname": "httpcollection",
  "rooturls": [
    "http://www.google.co.in",
    "http://www.bing.com"
  ],
  "allowpaths": [
    ".*"
  ],
  "disallowpaths": [
    "http://www.google.co.in/test/bingo"
  ],
  "allowformat": [
    "HTML",
    "text"
  ]
}

Document Description

JSON Fields

Value

apikey

API key accessible in the SearchBlox Admin Console. It is also present in the config.xml file.

colname

Name of the collection.

rooturls

The root URL is the starting URL for the spider. It requests this URL, indexes the content, and follows links from the URL. Make sure the root URL entered has regular HTML HREF links that the spider can follow.

allowpaths

The allowpath limits the spider to stay only within the given path or list of paths.
Example: http://www.searchblox.com/ (Informs the spider to stay only within the searchblox.com site).

disallowpaths

The disallowpath is the path or list of paths that you do not want the spider to crawl or index.

allowformats

Allowformats are the file types that are to be indexed. File types other than those specified here will not be indexed.

Update the Collection Settings

Update HTTP collection settings where you can configure various parameters to filter indexing.

Index URL

http://localhost:8080/searchblox/rest/collection/updateSettings

Method

POST

Media Type

application/json

Headers

content-type : application/json
accept: application/json

Document Syntax

{
    "apikey": "61282E82E5D6D8D409EFC87E8415CDAA",
    "colname": "httpcollection",
    "keyword-in-context": "false",
    "remove-duplicates": "false",
    "boost": "100",
    "stemming": "false",
    "spelling": "true",
    "logging": "true",
    "html-settings": {
        "description": "meta",
        "max-doc-age": "-1",
        "max-doc-size": "-1",
        "spider-max-depth": "6",
        "spider-max-delay": "1",
        "user-agent": "SearchBlox",
        "referer": "Google",
        "ignore-robots": "false",
        "follow-sitemap": "false",
        "follow-redirect": "true"
    },
    "basic-auth-settings": {
        "username": "searchblox",
        "password": "testing"
    },
    "form-auth-settings": {
        "form-url": "http://www.google.co.in",
        "form-action": "post",
        "form": [{
            "name": "httpcollection",
            "value": "google"
        }, {
            "name": "httpcollection1",
            "value": "searchblox"
        }]
    },
    "proxy-settings": {
        "server-url": "http://searchblox.com/proxy",
        "username": "proxy",
        "password": "adasd"
    }
}

Document Description

JSON Fields

Attributes

Value

apikey

API key accessible in the SearchBlox Admin Console. It is also present in the config.xml file.

colname

Name of the Collection.

keyword-in-context

Value is set to Yes or No to enable or disable keyword-in-context display respectively.
The keyword-in-context returns search results with the description displayed from content areas where the search term occurs. To enable give yes and to disable give no.

remove-duplicates

Value is set to Yes or No to remove duplicates or allow duplicate documents while indexing respectively.

boost

Boost search terms for the collection by setting a value greater than 1 (maximum value 9999).

stemming

Value is set to Yes or No to enable or disable stemming respectively. When stemming is enabled, inflected words are reduced to root form. For example, "running", "runs", and "ran" are the inflected form of run.

logging

Value is set to Yes or No to enable or disable logging respectively.

html-settings

description

This description setting configures the HTML parser to read the description for a document. You can specify any one of the following HTML tags to be read as description.
Description, h1, h2, h3, h4 ,h5, h6.

max-doc-age

Specifies the maximum allowable age in days of a document in the collection. By giving -1 we do not specify any maximum allowable age.

max-doc-size

Specifies the maximum allowable size in kilobytes of a document in the collection. By giving -1 we do not specify any maximum document size.

spider-max depth

Specifies the maximum depth the spider is allowed to proceed to index documents. Value can be specified from 1-10.

spider-max-delay

Specifies the wait time in milliseconds for the spider between HTTP requests to a web server. By giving 0 we specify no delay.

user-agent

Specifies the name under which the spider requests documents from a webserver.

referer

Specifies the URL value set in the request headers to specify where the user agent previously visited.

ignore-robots

Value is set to Yes or No to tell the spider to obey robot rules or not.

follow-sitemap

Value is set to Yes or No to tell the spider whether sitemaps alone can be indexed, or if all of the URLs have to be indexed respectively.

follow-redirect

Is set to Yes or No to instruct the spider to automatically follow redirects or not.

basic-auth-settings

username

These settings help in indexing content secured by HTTP Basic authentication.
Username for basic authentication has to be specified for this attribute.

password

Password for basic authentication has to be specified for this attribute.

form-auth-settings

form-url

These settings help in indexing content protected using form based authentication.
Form-url is the ACTION URL of the authentication HTML form.

Form-action

Specifies whether the form action is a POST or GET.

form – name, value

The set of name/value pairs that are required. For example, username and password information for authentication are set here.
Example
Name,Value
Web User,myself
Password,abc123
Login,true

proxy-settings

server-url

These settings help in indexing content through proxy servers.
This specifies the URL to access the proxy server.

username

When the proxy server requires authentication, set the username.

password

Set the password.

Update the Scheduler Settings

Index URL

http://localhost:8080/searchblox/rest/collection/updateScheduler

Method

POST

Media Type

application/json

Headers

content-type : application/json
accept: application/json

Document Syntax

{
    "apikey": "61282E82E5D6D8D409EFC87E8415CDAA",
    "colname": "httpcollection",
     "index":{
         "frequency":"ONCE",
         "timestamp":"21-01-2016 19:05:00"
         },
      "clear":{
        "frequency":"MINUTELY",
        "timestamp":"21-01-2016 18:05:00"
         },
      "refresh":{
         "frequency":"WEEKLY",
         "timestamp":"25-01-2016 30:05:00"
  }
}

Document Description

XML Tag

Attribute

Value

apikey

API key accessible in the SearchBlox Admin Console. It is also present in the config.xml file.

colname

Name of the Collection

index

frequency

Specifies the frequency of indexing of web documents. The values can be ONCE, DAILY, MINUTELY, WEEKLY and MONTHLY.

timestamp

Specifies the timestamp when the indexing has to start. Example: 21-01-2016 19:05:00.

clear

frequency

Specifies the frequency of clearing of indexed documents. The values can be ONCE, DAILY, MINUTELY, WEEKLY and MONTHLY.

timestamp

Specifies the timestamp when the clearing has to occur. Example: 21-01-2016 19:05:00.

refresh

frequency

Specifies the frequency of refreshing of web documents. The values can be ONCE, DAILY, MINUTELY, WEEKLY and MONTHLY.

timestamp

Specifies the timestamp when the refreshing has to start. Example: 21-01-2016 19:05:00.

Index, Refresh or Stop Indexing the collection

Index URL

http://localhost:8080/searchblox/rest/collection/actions

Method

POST

Media Type

application/json

Headers

content-type : application/json
accept: application/json

Document Syntax

Indexing collection

{
    "apikey": "61282E82E5D6D8D409EFC87E8415CDAA",
    "colname": "httpcollection",
    "action":"index"
}

Refreshing Collection

{
    "apikey": "61282E82E5D6D8D409EFC87E8415CDAA",
    "colname": "httpcollection",
    "action": "refresh"
}

Stop Indexing Collection

{
    "apikey": "61282E82E5D6D8D409EFC87E8415CDAA",
    "colname": "httpcollection",
    "action":"stop"
}

Document Description

JSON fields

Value

apikey

API key is accessible in the SearchBlox Admin Console. It is also present in the config.xml file.

colname

Name of the collection.

action

Specifies the type of action to be performed. Index is specified to start indexing the collection.
If the indexing process is going on, stop is specified to stop indexing the collection
Refresh is specified to start refreshing the collection.

document

uid

Response Code Description

5000

Collection Created Successfully

50001

Collection Exists/Collection Type Not Found

50002

Invalid JSON

50003

Collection Deleted Successfully

50005

Collection Path Saved Successfully

50006

Specified collection is not a CUSTOM collection

50007

Invalid Request/Collection Not Found

50008

Collection Indexing/Collection Indexing has been stopped/Collection schedule saved successfully

50009

Invalid Request/ Collection Not Found

601

API Key Not Valid

Updated about a month ago


HTTP API


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.