File Formats
SearchBlox automatically detects and extracts text and metadata from 40+ file types during the crawling and indexing process. This means users can search the actual content inside documents, spreadsheets, presentations, emails, and more, not just file names or metadata.
How File Indexing Works
When SearchBlox encounters a supported file, it:
- Extracts text content from the file body (e.g., paragraphs in a Word doc, cells in a spreadsheet, slides in a presentation)
- Extracts metadata such as title, author, creation date, and last modified date where available
- Processes attachments in email and archive formats, indexing the contents of each attachment individually
- Applies OCR to image-based files (such as scanned PDFs or TIFFs) where OCR is configured, making their text content searchable
Note: Archive formats (.zip, .tar, .tar.bz2, .tar.gz) are automatically extracted, and the individual files inside are indexed based on the supported formats listed below. Nested archives are also supported.
Supported File Formats
Here is a summary of the file types and their corresponding extensions supported by SearchBlox:
| File Type | Extensions |
|---|---|
| HTML | .html,.htm,.aspx,.asp |
| XML | .xml |
| Excel | .xls, .xlsx, .xlsm, .xlt, .xltx, .xla, .xlam, .xlb, .xlsb, .xll |
| PowerPoint | .ppt, .pptx, .pptm, .pptm, .ppsm, .pps, .ppsx, .ppa, .pot, .potx, .potm |
| RTF | .rtf |
| Word | .doc, .docx, .docm, .dotx, .dotm, .xps |
| Text | .txt, .csv |
| Visio | .vst, .vsd |
| EPUB | .epub |
| AutoCAD | .dwg |
| OpenOffice | .ods, .odt, .odp |
| iWorks | .pages, .numbers, .key |
| WordPerfect | .wp, .wpd |
| Images | .jpg, .tiff, .gif, .png, .svg, .psd, .bmp |
| Audio | .aif, .mp3, .mid, .wav |
| Video | .mpg, .flv, .mp4 |
| Outlook PST email archive files (32-bit and 64-bit, including attachments) | .pst |
| .eml, .msg | |
| Archive | .zip, .tar, .tar.bz2, tar.gz |
| HTTP Archive files | .har |
Notes:
- Outlook PST files are supported in both 32-bit and 64-bit formats. Attachments within PST archives are also indexed.
- Archive formats (.zip, .tar, .tar.bz2, .tar.gz) are extracted and their contents indexed based on the file types listed above.
- Audio and Video files are indexed for metadata (filename, title, tags) rather than full audio/video content transcription.
- Image files with embedded or extractable text (e.g., scanned documents saved as .tiff or .pdf) may be processed using OCR where configured.
Updated 13 days ago
What’s Next
