UC Indexing
Index/re-index is the process of sending various entity metadata and ACLs to the search back-end (Apache Solr or OpenSearch). This is to make sure that various search features work correctly.
Below is a list of endpoints that depends on search.
GET /storage/file
GET /storage/{id}/file
PUT /item
GET /item
GET /item/saved/{hash}
GET /search/saved/{hash}
GET /search
PUT /search
PUT /search/autocomplete
GET /search/shape
PUT /search/shape
GET /search/file
PUT /search/file
PUT /document/search
PUT /API/metadata-field/field-group
Normally, indexing/re-indexing happens automatically when there is any update to relevant entities. For example: item/collection metadata update, ACL updates, file size changes, etc.
They can also be triggered manually using:
PUT /item/{id}/re-index
PUT /reindex/{type}
How does the indexing/reindexing process look like internally?
They are two index processing routes internally:
Route A:
Some API request, storage scanning, or job processing threads make changes to certain entities.
The same process handling the request/change will build an new Solr/ES document, and send it to the
IndexQueuein ActiveMQ.IndexCruncherorOpenSearchIndexCruncher, depending on whether Solr or OpenSearch is used, picks up the messages fromIndexQueue, and send them to the search bac-kend. After this, the search result should reflect the latest changes.
Route B, which is the most common route:
Some API request, storage scanning, or job processing threads make changes to certain entities.
The same process handling the request/change will mark revelent entries in
t_indexlogtable as pending for re-index.ReindexCruncherpicks up the changes int_indexlogalmost immediately, builds index documents of related entities, and send them to theIndexQueuin ActiveMQ.IndexCruncherorOpenSearchIndexCruncher, depending on whether Solr or OpenSearch is used, picks up the messages fromIndexQueue, and send them to the search back-end. After this, the search result should reflect the latest changes.
When to perform re-index?
As mentioned above, re-indexing of an entity happens automatically, and the result should be searchable in a short while. Typically, a manually re-index is only needed when:
The search index in not available anymore. For example: the Solr/OpenSearch index is not preserved after a system migration.
Some fatal error happened to ActiveMQ, and the unprocessed messages are lost.
You want to be extra sure that an entity is re-indexed.
How to perform re-index
Perform the following request in sequence to re-index the whole system.
PUT /reindex/acl
PUT /reindex/item
PUT /reindex/collection
PUT /reindex/file
PUT /reindex/document
To only re-index entities that have been changed within a certain time range, for example in the case of ActiveMQ failure mentioned above, use this SQL query like :
update t_indexlog set c_status =1 where c_processed >'2022-01-18';
Don't forget to change the time in the example.
To only re-index an item:
PUT /item/{id}/re-index