Index/re-index is the process of sending various entity metadata and ACLs to the search back-end (Apache Solr or OpenSearch). This is to make sure that various search features work correctly.
Below is a list of endpoints that depends on search.
-
GET /storage/file
-
GET /storage/{id}/file
-
PUT /item
-
GET /item
-
GET /item/saved/{hash}
-
GET /search/saved/{hash}
-
GET /search
-
PUT /search
-
PUT /search/autocomplete
-
GET /search/shape
-
PUT /search/shape
-
GET /search/file
-
PUT /search/file
-
PUT /document/search
-
PUT /API/metadata-field/field-group
Normally, indexing/re-indexing happens automatically when there is any update to relevant entities. For example: item/collection metadata update, ACL updates, file size changes, etc.
They can also be triggered manually using:
-
PUT /item/{id}/re-index
-
PUT /reindex/{type}
How does the indexing/reindexing process look like internally?
They are two index processing routes internally:
Route A:
-
Some API request, storage scanning, or job processing threads make changes to certain entities.
-
The same process handling the request/change will build an new Solr/ES document, and send it to the
IndexQueuein ActiveMQ. -
IndexCruncherorOpenSearchIndexCruncher, depending on whether Solr or OpenSearch is used, picks up the messages fromIndexQueue, and send them to the search bac-kend. After this, the search result should reflect the latest changes.
Route B, which is the most common route:
-
Some API request, storage scanning, or job processing threads make changes to certain entities.
-
The same process handling the request/change will mark revelent entries in
t_indexlogtable as pending for re-index. -
ReindexCruncherpicks up the changes int_indexlogalmost immediately, builds index documents of related entities, and send them to theIndexQueuin ActiveMQ. -
IndexCruncherorOpenSearchIndexCruncher, depending on whether Solr or OpenSearch is used, picks up the messages fromIndexQueue, and send them to the search back-end. After this, the search result should reflect the latest changes.
When to perform re-index?
As mentioned above, re-indexing of an entity happens automatically, and the result should be searchable in a short while. Typically, a manually re-index is only needed when:
-
The search index in not available anymore. For example: the Solr/OpenSearch index is not preserved after a system migration.
-
Some fatal error happened to ActiveMQ, and the unprocessed messages are lost.
-
You want to be extra sure that an entity is re-indexed.
How to perform re-index
Perform the following request in sequence to re-index the whole system.
-
PUT /reindex/acl
-
PUT /reindex/item
-
PUT /reindex/collection
-
PUT /reindex/file
-
PUT /reindex/document
To only re-index entities that have been changed within a certain time range, for example in the case of ActiveMQ failure mentioned above, use this SQL query like :
update t_indexlog set c_status =1 where c_processed >'2022-01-18';
Don't forget to change the time in the example.
To only re-index an item:
PUT /item/{id}/re-index