PreText
Introduction
Predominantly, the data we have on websites is generally of heterogeneous format. We have ample limitations in pulling it all together to unravel a picture of the content behind it. Each individual might have their perspective of the content on a page. Sometimes the data might be so diverse that it tends to be pretty hard to ingest using the traditional data ingestion mechanisms.
Pretext NLP has achieved this objective of making the metadata more rich & insightful while performing a search from different perspectives. This pipeline utilizes millions of untapped resources and enriches the “data” into “knowledge” using AI models.
Architecture
SearchBlox PreText NLP endpoint can generate AI-based ML fields such as ml_title, ml_description, ml_topic, ml_sentimentLabel, ml_sentimentScore, ml_entity_org, ml_entity_product, ml_entity_person, ml_entity_loc, ml_entity_gpe.
Create PreText NLP Endpoint
You can signup a FREE account for SearchBlox-Developed PreText NLP using the link: https://pretext.searchblox.com/
You can also find PreText account signup link under PreText screen:
Provide the PreText account details as shown in the screenshot below:
Once the Create an account now
button is clicked, an email will be sent for activation. Check the mail and click on activation link, on successful activation Login Page
will be displayed.
Important:
If Activation of account is not successful, you will be able to login, but creation of endpoint will fail.
PreText login screen:
Once you log in to your FREE PreText account, you will see the PreText Dashboard screen from where you can create the PreText pipeline to integrate it with the SearchBlox Enterprise Application, by clicking Create Pipeline
.
Create SearchBlox PreText Pipeline
Choose the provider as SearchBlox once you click on Create Pipeline.
Provide the SearchBlox PreText Pipeline name and select the tasks which you want to integrate with SearchBlox Collection. (By default, all the tasks are selected)
Integration with Hugging Face API using PreText Pipeline
You can configure the Hugging Face API within PreText Pipeline and integrate it with the SearchBlox collection. To do this first choose the provide as Hugging Face.
This configuration requires the Hugging Face API for integration and requires a unique name for SearchBlox identification. You can choose the required tasks. By default, all the tasks are selected.
Using the Hugging Face pipeline, you can also create custom tasks and use it with SearchBlox integration. With custom task creation, provide a unique name, Task Type (You can choose from the list), select the Model Name (Type ahead Hugging Face models will popup). Make sure you select the custom task in the pipeline screen once created.
Once you create the pipeline, you can see the pipeline in the PreText Dashboard.
Copy the pipeline endpoint to integrate it with SearchBlox Collection.
NOTE:
PreText NLP FREE account has limitations. To upgrade to the premium services, please contact SearchBlox Support.
Configure PreText Pipeline
- Create a SearchBlox collection before configuring the PreText pipeline. PreText works with all the SearchBlox collection types. And the same is applicable for SearchBlox or Hugging Face pipeline endpoints.
- On the MenuBar, click on the Search AI to see varied options like PreText, SmartSuggest, etc.; ensure to click on PreText to utilize the pipeline.
- Provide the following details to apply the pipeline’s functionality to your collection:
a. PreText endpoint that has been copied from your existing PreText account
b. Select the collection for which you need to apply the AI pipeline
c. First time, turn off 'Replace ML Values' button. If you enable this option, SearchBlox will map the ML title and ML description to Search Title and Search Description automatically.
e. Add 'Topic Classification' labels if you want to generate ml_topic meta field.
f. Enable OCR to allow PreText to process the text from image-like documents such as JPG, PNG, PDF etc.
Finally click on the “Create” button to create your PreText pipeline.
Note:
OCR feature has a dependency on Tesseract Software. Please find the OCR Reference:
https://developer.searchblox.com/v9.2.3/docs/filesystem-collection#ocr-recognition
to install/update Tesseract in your machine.
- Start reindexing your collection after you configure the PreText endpoint with collection
- You will be able to see an AI-generated title as ml_fields and values when you make a search in debug mode. To view the AI-based fields generated for the populated search results by giving &debug=true as a URL parameter to see PreText generated search response in JSON format.
Below is the screenshot when the PreText setting 'Replace Original Values with ML Values' is set to OFF.
Below is the screenshot when the PreText setting 'Replace Original Values with ML Values' is set to ON You can see the original document title and description are replaced by ML Title and ML Description values. fields in the debug mode search response. The remaining Ml fields will get displayed as it is.
PreText Search UI
You can access PreText search UI using https://localhost:8443/pretext/index.html
Select the PreText UI Template to view ML values:
- PreText UI with Replace ML Values ON: PreText UI shows all Documents with AI-generated titles and descriptions with ML facets related to Topic, Sentiment and Entities. You can see Original title and original Descriptions are shown as actual document titles and descriptions.
Note
We can turn off PreText UI display for Original Title and Original Description by setting the values of original_title and original_desc to false in .../webapps/ROOT/pretext/index.html file.
We can also configure this through facet.js using configuration properties 'originalTitle' and 'originalDesc', and the index.html settings have priority over facet.js configuration.
We can also use URL parameters for temporary access, i.e., &originalDesc &originalTitle. Set the values to true to enable Original Title and Original Description display. The value false will turn off the Original Title and Original Description display in PreText Search UI.
- PreText UI with Replace ML Values OFF: PreText UI shows all Documents with AI-generated ML Titles and ML Descriptions with ML facets related to Topic, Sentiment and Entities when you turn ON ML Replace Setting. You can see the actual Document titles and descriptions to compare it with ML Titles and ML Descriptions on the PreText Search Page.
PreText Debug Log
You can find the pretext.log under path /webapps/ROOT/logs/
PreText Basic Log:
Basic Log info of pretext.log gives the PreText pipeline info along with pretext request and status code. Below is the reference screenshot:
PreText Debug Mode Log:
While configuring pretext endpoint append &debug=true
to your PreText NLP endpoint, this will give a detailed processing response for each PreText request as shown in the screenshot below:
ERROR CODES
Status Code | Status Message |
---|---|
400 | Error parsing the body. |
500 | Timeout Error. |
504 | Gateway Timeout Error. |
524 | Server Down. |
Conclusion
SearchBlox provides ML fields like ml_title, ml_description, ml_topic, ml_sentimentLabel, ml_sentimentScore, ml_entity_org, ml_entity_product, ml_entity_person, ml_entity_loc, ml_entity_gpe for your collection documents using AI models thereby allowing you to explore your hidden insights and making your content searchable with the least effort.
All ML values will be generated based on the Document Content. The SearchBlox AI Model will decide the ML field values.
ML Field | ML Field Description |
---|---|
ml_title | AI Title generated for your document. This helps to avoid duplicate titles, filenames as titles or inaccurate titles of your original document titles if any. |
ml_description | AI-generated description for your document to improve your search and to make your documents findable to the users with popular searches. |
ml_topic | Custom labels are defined for your document based on your document content. Topics can be configured and used further to filter the search results. This field can be further used to filter the search results. |
ml_sentimentLabel | The sentiment label decides the opinion on the document based on the words or sentences in the content. Subjective information in an expression, that is, the opinions, appraisals, emotions, or attitudes towards a topic, person or entity. Expressions can be classified as either positive or negative. For example: “I really like the new design of your website!” gives a positive impression on the document. This SentimentLabel field can be further used to filter the search results. |
ml_sentimentScore | Sentiment scoring is enabled using algorithms that assess the tone of a transcript on a spectrum of positive to negative. Based on the score sentiment label is decided. |
ml_entity_org | If the SearchBlox AI Model finds any Organization name, it is taken as a 'ml_entity_org' value for a particular document. This field can be further used to filter the search results. |
ml_entity_product | If the SearchBlox AI Model finds any Product information, it is taken as a 'ml_entity_product' value for a particular document. This field can be further used to filter the search results. |
ml_entity_person | If the SearchBlox AI Model finds any Person's Name, it is taken as a 'ml_entity_person' value for a particular document. This field can be further used to filter the search results. |
ml_entity_loc | If the SearchBlox AI Model finds any Location information, it is taken as a 'ml_entity_location value for a particular document. This field can be further used to filter the search results. |
ml_entity_gpe | If the SearchBlox AI Model finds any Geo-Political Economy information, it is taken as a 'ml_entity_gpe' value for a particular document. This field can be further used to filter the search results. |
ml_time_taken | Total time is taken to process the document and generate ML fields by PreText service. |
Updated about 1 year ago