Introduction

Predominantly, the data we have on websites is generally of heterogeneous format. We have ample limitations in pulling it all together to unravel a picture of the content behind it. Each individual might have their perspective of the content on a page. Sometimes the data might be so diverse that it tends to be pretty hard to ingest using the traditional data ingestion mechanisms.

Pretext NLP has achieved this objective of making the metadata more rich & insightful while performing a search from different perspectives. This pipeline utilizes millions of untapped resources and enriches the “data” into “knowledge” using AI models.

Architechture

SearchBlox PreText NLP endpoint can generate AI-based ML fields such as ml_title, ml_description, ml_topic, ml_sentimentLabel, ml_sentimentScore, ml_entity_org, ml_entity_product, ml_entity_person, ml_entity_loc, ml_entity_gpe.

Create PreText NLP Endpoint

You can signup a FREE account for SearchBlox-Developed PreText NLP using the link: https://pretext.searchblox.com/

You can also find PreText account signup link under PreText screen:

Provide the PreText account details as shown in the screenshot below:

PreText login screen:

Once you log in to your PreText account FREE account, you will see the PreText Dashboard screen from where you can create(Create Pipeline) the PreText pipeline to integrate it with the SearchBlox Enterprise Application.

Create SearchBlox PreText Pipeline

Choose the provider as SearchBlox once you click on Create Pipeline.

Provide the SearchBlox PreText Pipeline name and select the tasks which you want to integrate with SearchBlox Collection. By default, all the tasks are selected.

Integration with Hugging Face API using PreText Pipeline

You can configure the Hugging Face API within PreText Pipeline and integrate it with the SearchBlox collection. To do this first choose the provide as Hugging Face.

This configuration requires the Hugging Face API for integration and requires a unique name for SearchBlox identification. You can choose the required tasks. By default, all the tasks are selected.

Using the Hugging Face pipeline, you can also create custom tasks and use it with SearchBlox integration. With custom task creation, provide a unique name, Task Type (You can choose from the list), select the Model Name (Type ahead Hugging Face models will popup). Make sure you select the custom task in the pipeline screen once created.

Once you create the pipeline, you can see the pipeline in the PreText Dashboard.

Copy the pipeline endpoint to integrate it with SearchBlox Collection.

📘

NOTE:

PreText NLP FREE account has limitations. To upgrade to the premium services, please contact SearchBlox Support.

Configure PreText Pipeline

  1. Create a SearchBlox collection before configuring the PreText pipeline. PreText works with all the SearchBlox collection types. And the same is applicable for SearchBlox or Hugging Face pipeline endpoints.
  1. On the MenuBar, click on the Search AI to see varied options like PreText, SmartSuggest, etc.; ensure to click on PreText to utilize the pipeline.
  1. Provide the following details to apply the pipeline’s functionality to your collection:
    a. PreText endpoint that is been copied from your existing PreText account
    b. Select the collection for which you need to apply the AI pipeline
    c. First time, turn off 'Replace ML Values' button. If you enable this option, SearchBlox will map the ML title and ML description to Search Title and Search Description automatically.
    e. Add 'Topic Classification' labels if you want to generate ml_topic meta field.
    f. Enable OCR to allow PreText to process the text from image-like documents such as JPG, PNG, PDF etc.

Finally click on the “Create” button to create your PreText pipeline.

🚧

Note:

OCR feature has a dependency on Tesseract Software. Please find the OCR Reference: https://developer.searchblox.com/v9.2.3/docs/filesystem-collection#ocr-recognition to install/update Tesseract in your machine.

  1. Start reindexing your collection after you configure the PreText endpoint with collection
  1. You will be able to see an AI-generated title as ml_fields and values when you make a search in debug mode. To view the AI-based fields generated for the populated search results by giving &debug=true as a URL parameter to see PreText generated search response in JSON format.

Below is the screenshot when the PreText setting 'Replace Original Values with ML Values' is set to OFF.

Below is the screenshot when the PreText setting 'Replace Original Values with ML Values' is set to ON You can see the original document title and description are replaced by ML Title and ML Description values. fields in the debug mode search response. The remaining Ml fields will get displayed as it is.

PreText Search UI

You can access PreText search UI using https://localhost:8443/pretext/index.html

Select the PreText UI Template to view ML values:

  1. PreText UI with Replace ML Values ON: PreText UI shows all Documents with AI-generated titles and descriptions with ML facets related to Topic, Sentiment and Entities. You can see Original title and original Descriptions are shown as actual document titles and descriptions.

📘

Note

  1. We can turn off PreText UI display for Original Title and Original Description by setting the values of original_title and original_desc to false in .../webapps/ROOT/pretext/index.html file.

  2. We can also configure this through facet.js using configuration properties 'originalTitle' and 'originalDesc', and the index.html settings have priority over facet.js configuration.

  3. We can also use URL parameters for temporary access, i.e., &originalDesc &originalTitle. Set the values to true to enable Original Title and Original Description display. The value false will turn off the Original Title and Original Description display in PreText Search UI.

  1. PreText UI with Replace ML Values OFF: PreText UI shows all Documents with AI-generated ML Titles and ML Descriptions with ML facets related to Topic, Sentiment and Entities when you turn ON ML Replace Setting. You can see the actual Document titles and descriptions to compare it with ML Titles and ML Descriptions on the PreText Search Page.

PreText Debug Log

You can find the pretext.log under path /webapps/ROOT/logs/

PreText Basic Log:
Basic Log info of pretext.log gives the PreText pipeline info along with pretext request and status code. Below is the reference screenshot:

PreText Debug Mode Log:
While configuring pretext endpoint append &debug=true to your PreText NLP endpoint, this will give a detailed processing response for each PreText request as shown in the screenshot below:

ERROR CODES

Status Code

Status Message

400

Error parsing the body.

500

Timeout Error.

504

Gateway Timeout Error.

524

Server Down.

Conclusion

SearchBlox provides ML fields like ml_title, ml_description, ml_topic, ml_sentimentLabel, ml_sentimentScore, ml_entity_org, ml_entity_product, ml_entity_person, ml_entity_loc, ml_entity_gpe for your collection documents using AI models thereby allowing you to explore your hidden insights and making your content searchable with the least effort.

All ML values will be generated based on the Document Content. The SearchBlox AI Model will decide the ML field values.

ML Field

ML Field Description

ml_title

AI Title generated for your document. This helps to avoid duplicate titles, filenames as titles or inaccurate titles of your original document titles if any.

ml_description

AI-generated description for your document to improve your search and to make your documents findable to the users with popular searches.

ml_topic

Custom labels are defined for your document based on your document content. Topics can be configured and used further to filter the search results. This field can be further used to filter the search results.

ml_sentimentLabel

The sentiment label decides the opinion on the document based on the words or sentences in the content. Subjective information in an expression, that is, the opinions, appraisals, emotions, or attitudes towards a topic, person or entity. Expressions can be classified as either positive or negative. For example: “I really like the new design of your website!” gives a positive impression on the document.

This SentimentLabel field can be further used to filter the search results.

ml_sentimentScore

Sentiment scoring is enabled using algorithms that assess the tone of a transcript on a spectrum of positive to negative. Based on the score sentiment label is decided.

ml_entity_org

If the SearchBlox AI Model finds any Organization name, it is taken as a 'ml_entity_org' value for a particular document. This field can be further used to filter the search results.

ml_entity_product

If the SearchBlox AI Model finds any Product information, it is taken as a 'ml_entity_product' value for a particular document. This field can be further used to filter the search results.

ml_entity_person

If the SearchBlox AI Model finds any Person Name, it is taken as a 'ml_entity_person' value for a particular document. This field can be further used to filter the search results.

ml_entity_loc

If the SearchBlox AI Model finds any Location information, it is taken as a 'ml_entity_location value for a particular document. This field can be further used to filter the search results.

ml_entity_gpe

If the SearchBlox AI Model finds any Geo-Political Economy information, it is taken as a 'ml_entity_gpe' value for a particular document. This field can be further used to filter the search results.

ml_time_taken

Total time is taken to process the document and generate ML fields by PreText service.


Did this page help you?