File Formats

SearchBlox automatically detects and extracts text and metadata from 40+ file types during the crawling and indexing process. This means users can search the actual content inside documents, spreadsheets, presentations, emails, and more, not just file names or metadata.
How File Indexing Works
When SearchBlox encounters a supported file, it:

  • Extracts text content from the file body (e.g., paragraphs in a Word doc, cells in a spreadsheet, slides in a presentation)
  • Extracts metadata such as title, author, creation date, and last modified date where available
  • Processes attachments in email and archive formats, indexing the contents of each attachment individually
  • Applies OCR to image-based files (such as scanned PDFs or TIFFs) where OCR is configured, making their text content searchable

Note: Archive formats (.zip, .tar, .tar.bz2, .tar.gz) are automatically extracted, and the individual files inside are indexed based on the supported formats listed below. Nested archives are also supported.

Supported File Formats

Here is a summary of the file types and their corresponding extensions supported by SearchBlox:

File TypeExtensions
HTML.html,.htm,.aspx,.asp
XML.xml
Excel.xls, .xlsx, .xlsm, .xlt, .xltx, .xla, .xlam, .xlb, .xlsb, .xll
PowerPoint.ppt, .pptx, .pptm, .pptm, .ppsm, .pps, .ppsx, .ppa, .pot, .potx, .potm
RTF.rtf
Word.doc, .docx, .docm, .dotx, .dotm, .xps
Text.txt, .csv
PDF.pdf
Visio.vst, .vsd
EPUB.epub
AutoCAD.dwg
OpenOffice.ods, .odt, .odp
iWorks.pages, .numbers, .key
WordPerfect.wp, .wpd
Images.jpg, .tiff, .gif, .png, .svg, .psd, .bmp
Audio.aif, .mp3, .mid, .wav
Video.mpg, .flv, .mp4
Outlook PST email archive files
(32-bit and 64-bit, including attachments)
.pst
Email.eml, .msg
Archive.zip, .tar, .tar.bz2, tar.gz
HTTP Archive files.har

Notes:

  • Outlook PST files are supported in both 32-bit and 64-bit formats. Attachments within PST archives are also indexed.
  • Archive formats (.zip, .tar, .tar.bz2, .tar.gz) are extracted and their contents indexed based on the file types listed above.
  • Audio and Video files are indexed for metadata (filename, title, tags) rather than full audio/video content transcription.
  • Image files with embedded or extractable text (e.g., scanned documents saved as .tiff or .pdf) may be processed using OCR where configured.

What’s Next