Optimized Search Solution for Storing and Retrieving Large Files of Legacy Systems
Keywords:
Searching, Distributed Systems, Legacy systems, Large files, Distributed systems, Key-Value Mapping, Elastic Search, HBase, Search Indexes, NoSQLAbstract
As organizations transition from legacy systems to modern architectures, managing and retrieving historical data efficiently becomes a significant challenge. Traditional relational databases are often unsuitable due to their high storage costs and limited scalability. In addition to this, organizations acquire other organizations and the storage engines used in legacy companies will be slowly migrated to new processes that the parent organization uses, but the companies wanted to retain the data of legacy systems for auditing and analysis purposes. Approach discussed in this paper involves converting structured and unstructured legacy system data into files, which are then stored in HDFS for cost-effective, distributed storage. To enable fast search and retrieval, we employ Elastic Search to index metadata and key terms extracted from these files. Since Elastic Search is designed for real-time indexing and full- text search, it allows users to perform rapid lookups based on predefined attributes. However, storing complete file metadata in Elastic Search can be inefficient. To optimize the process, we leverage HBase as a NoSQL mapping layer that links search indices to the corresponding HDFS file paths. This ensures that, rather than storing entire file details within Elastic Search, only essential metadata is indexed, and full records can be retrieved efficiently using HBase as a key-value lookup store. The proposed system optimizes both storage costs and query performance by distributing large data across HDFS while leveraging the indexing capabilities of Elastic Search and the fast lookup capabilities of HBase. This architecture is particularly beneficial for enterprises dealing with regulatory compliance, audits, and historical data access, where retaining legacy data is essential but needs to be both cost effective and easily searchable. The study concludes that the combination of HDFS for distributed storage, HBase for index mapping, and Elastic Search for keyword-based searching provides an optimal balance between cost efficiency and performance for managing legacy system archives.
