Lucene text search
Overview
Documentation
Description
The Lucene text search module provides functionality to find documents and data fast. The implementation is optimized for Mendix. It provides an alternative when you need more speed than XPath and OQL. You can use multiple indexes and search over multiple entities.
Typical usage scenario
Type usage scenarios are
- Search for text in documents.
- Find data fast when using more than 100.000 records.
- Finding data in multiple objects at once, like orders and orderlines and items.
Features and limitations
- Searching in filedocuments, excel (xls and xlsx), word (doc and docx), xml, text and pdf.
- Automatic recognizing of file formats
- Define your own indexed texts.
- Supports multiple indexes
- Batch create and update indexes.
- Backup and restore for Mendix cloud is included
- Transfer indexes between environments.
- Searching with wildcards and logical operators like AND – OR
Installation
- Import the module from the appstore
- Upgrade the marketplace Excelimporter and Excelexporter modules and remove the old Jars (POI3.*).
- Optionally remove the sample implementation if you don’t need it.
Configuration
Look at the entities SampleCompany and SampleFileDocument for an example. You can remove them if your implementation works or if you have implemented this module before.
- Choose the number of indexes. Indexes are defined with a unique number. Choose that when updating or searching. It is ok to use one index and put all data in that. You also choose to multiple indexes, for example
- Customers, Orders, orderlines and items
- Contracts, Customers and documents.
- Choose the entities and attributes that must be indexed. Take the perspective from the end-user to decide which entities are grouped in one index. So to search orders add also items, customers etc in the indexed order.
- Optionally use an enum for that like Enum_IndexType and set them that entities.
- Create an N:1 association from Lucene.SearchResult to all indexed main entities which are your search results.
- Create a batch update microflow or OQL statement to create the index with existing data.
- Choose your index recovery strategy. Either choose ‘Batch’ or ‘Restore’. If you have an on premise implementation you can skip this step because the temporary folder is not deleted and in a local environment.
- If you use filedocuments for ‘Restore strategy’: connect the After startup (ASu_RestoreBackup) and before shutdown (BSd_CreateBackup) microflows.
- For ‘Batch’ run a batch update after the startup that creates the indexes. It is recommened you have this in case your application crashes. OQL is the fastest way to index all data.
- Update the indexes: Create Before-commit and before-delete microflows, use the ones in the module as an example. Two variants: one for entities and one for filedocuments.
- In the Before commit construct your searchable text, separated with spaces, don’t forget to exclude or replace the ‘null’ texts.
- Format the dates yourself.
- Implement search
- Create a search command
- Add search text like “Romans AND War”
- Call the Java action to find data with that search text
- For every result a SearchResult object is created with one of the assocations filled in.
- Open the correct screen.
- If you open the search result inspect the association and open the correct form. (See MF_OpenResult)
Search syntax
You can use words like ‘romans’ and ‘war’
You can use wildcards like te?t, test* or te*t.
Use “jakarta apache” for sentences.
You can use operators like AND and NOT and “-“ (mind the capitals)
Fields like title: are not helpful because the index key is the mendix identifier and all the words are in one searchstring.
The search command has special characters, you have to escaped them if you provide end-user search functionality. A java-action is provided for that.
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html
Technical details
The modules use the Apache Lucene fulltext module. This module creates a set of files in a directory. We use subdirectories of the temp folder like lucene_1. The index is used to find data.
Most of the time of Lucene indexing is consumed by updating the index. Updating 1 object will cost almost the same time a 500 objects. The real work is done in the background and will wait for 5 seconds or 500 objects to update the index. This is fast enough to run an Excel import with Before commit events without significant delay in the performance of the foreground proces.
After a time-out of 10 minutes the index is closed.
Fields are not implemented, You can do that by changing the java code.
Lognode: Lucene
Lucene version is 8.8.2
Upgrade from previous version
The file format has completely changed, so delete the index and batch create the index again.
Remove the old Lucene*.jars and Tika*.jar from the userlib. This is important other wise you will get java compilation errors.
Dependencies
- The modules uses the same POI jars as in the excel-importer module
Known issues
- Can not work in conjunction with older versions of excel importer and exporter (POI3.X jars)
Release notes
- Version 2.0.0
- Updated to Mendix 7.7.1, 8.18, 9.1
- Included Lucene 8.8.2 library files
- Use Tika jar for document text extracting
- Added batch update with OQL, update and index with 100.000 rows/minute
- Improved error handling
- Included full sample implementation in module.
- Centralized and updated atlas layouts.
- Exposed the java-actions as microflow actions
- Result count constant replaced with parameter.
- Added Dutch translations
Sample implementation
The included sample can be used to learn Lucene. In the resource folder a generic excel file (fortune1000-2012.xlsx) is included that can be easily imported with the excel importer (https://docs.mendix.com/appstore/modules/excel-importer). After importing use the search button to search for data. Note that the Before Commit actions hardly have any impact on the import. Two rebuild index microflows, one with batches and one with OQL are added to show you how to index existing data in your database.