Introduction

Predominantly, the data we have on websites is generally of heterogeneous format. We have ample limitations in pulling it all together to unravel a picture of the content behind it. Each individual might have their perspective of the content on a page. Sometimes the data might be so diverse that it tends to be pretty hard to ingest using the traditional data ingestion mechanisms.

Pretext NLP has achieved this objective of making the metadata more rich & insightful while performing a search from different perspectives. This pipeline utilizes millions of untapped resources and enriches the “data” into “knowledge” using AI models.

Architechture

SearchBlox PreText NLP endpoint can generate AI-based ML fields such as ml_title, ml_description, ml_topic, ml_sentimentLabel, ml_sentimentScore, ml_entity_org, ml_entity_product, ml_entity_person, ml_entity_loc, ml_entity_gpe.

2160

Create PreText NLP Endpoint

You can signup a FREE account for SearchBlox-Developed PreText NLP using the link: https://pretext.searchblox.com/

You can also find PreText account signup link under PreText screen:

1604

Provide the PreText account details as shown in the screenshot below:

617

PreText login screen:

917

Once you log in to your PreText account FREE account, you will see the PreText Dashboard screen from where you can create(Create Pipeline) the PreText pipeline to integrate it with the SearchBlox Enterprise Application.

1918

Create SearchBlox PreText Pipeline

Choose the provider as SearchBlox once you click on Create Pipeline.

1918

Provide the SearchBlox PreText Pipeline name and select the tasks which you want to integrate with SearchBlox Collection. By default, all the tasks are selected.

1920

Integration with Hugging Face API using PreText Pipeline

You can configure the Hugging Face API within PreText Pipeline and integrate it with the SearchBlox collection. To do this first choose the provide as Hugging Face.

1276

This configuration requires the Hugging Face API for integration and requires a unique name for SearchBlox identification. You can choose the required tasks. By default, all the tasks are selected.

1920

Using the Hugging Face pipeline, you can also create custom tasks and use it with SearchBlox integration. With custom task creation, provide a unique name, Task Type (You can choose from the list), select the Model Name (Type ahead Hugging Face models will popup). Make sure you select the custom task in the pipeline screen once created.

906

Once you create the pipeline, you can see the pipeline in the PreText Dashboard.

1188

Copy the pipeline endpoint to integrate it with SearchBlox Collection.

1174

📘

NOTE:

PreText NLP FREE account has limitations. To upgrade to the premium services, please contact SearchBlox Support.

Configure PreText Pipeline

  1. Create a SearchBlox collection before configuring the PreText pipeline. PreText works with all the SearchBlox collection types. And the same is applicable for SearchBlox or Hugging Face pipeline endpoints.
1917
  1. On the MenuBar, click on the Search AI to see varied options like PreText, SmartSuggest, etc.; ensure to click on PreText to utilize the pipeline.
203
  1. Provide the following details to apply the pipeline’s functionality to your collection:
    a. PreText endpoint that is been copied from your existing PreText account
    b. Select the collection for which you need to apply the AI pipeline
    c. First time, turn off 'Replace ML Values' button. If you enable this option, SearchBlox will map the ML title and ML description to Search Title and Search Description automatically.
    e. Add 'Topic Classification' labels if you want to generate ml_topic meta field.
    f. Enable OCR to allow PreText to process the text from image-like documents such as JPG, PNG, PDF etc.

Finally click on the “Create” button to create your PreText pipeline.

1603

🚧

Note:

OCR feature has a dependency on Tesseract Software. Please find the OCR Reference: https://developer.searchblox.com/v9.2.3/docs/filesystem-collection#ocr-recognition to install/update Tesseract in your machine.

  1. Start reindexing your collection after you configure the PreText endpoint with collection
1596
  1. You will be able to see an AI-generated title as ml_fields and values when you make a search in debug mode. To view the AI-based fields generated for the populated search results by giving &debug=true as a URL parameter to see PreText generated search response in JSON format.

Below is the screenshot when the PreText setting 'Replace Original Values with ML Values' is set to OFF.

1842

Below is the screenshot when the PreText setting 'Replace Original Values with ML Values' is set to ON You can see the original document title and description are replaced by ML Title and ML Description values. fields in the debug mode search response. The remaining Ml fields will get displayed as it is.

1845

PreText Search UI

You can access PreText search UI using https://localhost:8443/pretext/index.html

Select the PreText UI Template to view ML values:

1542
  1. PreText UI with Replace ML Values ON: PreText UI shows all Documents with AI-generated titles and descriptions with ML facets related to Topic, Sentiment and Entities. You can see Original title and original Descriptions are shown as actual document titles and descriptions.
1918

📘

Note

  1. We can turn off PreText UI display for Original Title and Original Description by setting the values of original_title and original_desc to false in .../webapps/ROOT/pretext/index.html file.

  2. We can also configure this through facet.js using configuration properties 'originalTitle' and 'originalDesc', and the index.html settings have priority over facet.js configuration.

  3. We can also use URL parameters for temporary access, i.e., &originalDesc &originalTitle. Set the values to true to enable Original Title and Original Description display. The value false will turn off the Original Title and Original Description display in PreText Search UI.

  1. PreText UI with Replace ML Values OFF: PreText UI shows all Documents with AI-generated ML Titles and ML Descriptions with ML facets related to Topic, Sentiment and Entities when you turn ON ML Replace Setting. You can see the actual Document titles and descriptions to compare it with ML Titles and ML Descriptions on the PreText Search Page.
1845

PreText Debug Log

You can find the pretext.log under path /webapps/ROOT/logs/

PreText Basic Log:
Basic Log info of pretext.log gives the PreText pipeline info along with pretext request and status code. Below is the reference screenshot:

1279

PreText Debug Mode Log:
While configuring pretext endpoint append &debug=true to your PreText NLP endpoint, this will give a detailed processing response for each PreText request as shown in the screenshot below:

1522

ERROR CODES

Status CodeStatus Message
400Error parsing the body.
500Timeout Error.
504Gateway Timeout Error.
524Server Down.

Conclusion

SearchBlox provides ML fields like ml_title, ml_description, ml_topic, ml_sentimentLabel, ml_sentimentScore, ml_entity_org, ml_entity_product, ml_entity_person, ml_entity_loc, ml_entity_gpe for your collection documents using AI models thereby allowing you to explore your hidden insights and making your content searchable with the least effort.

All ML values will be generated based on the Document Content. The SearchBlox AI Model will decide the ML field values.

ML FieldML Field Description
ml_titleAI Title generated for your document. This helps to avoid duplicate titles, filenames as titles or inaccurate titles of your original document titles if any.
ml_descriptionAI-generated description for your document to improve your search and to make your documents findable to the users with popular searches.
ml_topicCustom labels are defined for your document based on your document content. Topics can be configured and used further to filter the search results. This field can be further used to filter the search results.
ml_sentimentLabelThe sentiment label decides the opinion on the document based on the words or sentences in the content. Subjective information in an expression, that is, the opinions, appraisals, emotions, or attitudes towards a topic, person or entity. Expressions can be classified as either positive or negative. For example: “I really like the new design of your website!” gives a positive impression on the document.

This SentimentLabel field can be further used to filter the search results.
ml_sentimentScoreSentiment scoring is enabled using algorithms that assess the tone of a transcript on a spectrum of positive to negative. Based on the score sentiment label is decided.
ml_entity_orgIf the SearchBlox AI Model finds any Organization name, it is taken as a 'ml_entity_org' value for a particular document. This field can be further used to filter the search results.
ml_entity_productIf the SearchBlox AI Model finds any Product information, it is taken as a 'ml_entity_product' value for a particular document. This field can be further used to filter the search results.
ml_entity_personIf the SearchBlox AI Model finds any Person Name, it is taken as a 'ml_entity_person' value for a particular document. This field can be further used to filter the search results.
ml_entity_locIf the SearchBlox AI Model finds any Location information, it is taken as a 'ml_entity_location value for a particular document. This field can be further used to filter the search results.
ml_entity_gpeIf the SearchBlox AI Model finds any Geo-Political Economy information, it is taken as a 'ml_entity_gpe' value for a particular document. This field can be further used to filter the search results.
ml_time_takenTotal time is taken to process the document and generate ML fields by PreText service.