Trokam currently runs on a single-node database. This setup has the advantage of simplicity however there are some limitations and risks of this paradigm.
Data distributed on several nodes could be queried in parallel which is an improvement on performance compared to the same data stored in one big database.
No matter how big a single node could be, several nodes will have a larger capacity.
Data distributed over several nodes decreases the data loss in case of failure of one node.
Therefore, there are good reasons to go into de complexities of a multi-node database.
How it is done?
There is no single recipe to distribute a system like this one. In this project I would like to try two approaches:
- Cluster of PostgreSQL databases. The objective would be modify current database model to enable several of them operate in parallel. The road map of this implementation:
Modify current database model.
Deploy database in nodes.
Let the web crunchers continuously feed the databases.
Execute the searches in all nodes at the same time.
- Natively distributable database. There are several new and very interesting databases which are designed to operate on several nodes. I did not try anyone of these yet, however relying only in the documentation, Arango seems a promising alternative to try out.
Performance, reliability, maintainability and complexity will matter to decide which one would be the choice to run Trokam in the long term.
Why no Hadoop?
One popular approach is build the whole system on top Hadoop, which is a general purpose tool to distribute all sort of tasks.
An open source search engine based on Hadoop already exists, Nutch.
In this project I would like to explore a different path to develop a search engine.