Folding Space is a records management and software company based in Birmingham. They deliver innovative, value-for-money software solutions using complex, sensitive data in a secure and safe way, from checking companies are GDPR compliant to providing a service for digitising hard copies of many thousands of documents.
The company also enables clients to search digitised databases for documents that contain keywords or phrases of interest.
As Folding Space grew and took on larger clients such as NHS trusts, the document databases they dealt with similarly expanded. This expansion was great for the business but caused a number of challenges that needed to be addressed.
The first challenge was the increased resources required to perform database searches. For example, a user may wish to obtain all the records in the NHS database that contain the word ‘cancer’. A more complicated query may consist of two or more words such as ‘ovarian cancer’ or two different terms such as ‘cancer’ and ‘diabetic’.
The second challenge was sufficient database storage management. They needed a management system that could handle terabytes of storage for the existing document collections and associated data structures, to enable faster searches and to support the business in scaling up.
The aim of the project was to locate and evaluate software from the marketplace that could enable fast storage and searching of large numbers of documents. Folding Space consulted the Think Beyond Data team at Aston University to explore how they might achieve this.
In conjunction with Folding Space, the Think Beyond Data team identified a programme of work whereby available document searching tools would be identified and comprehensively evaluated against existing exemplar data and queries. A range of computing configurations were to be tested, with the identified software tools ranging from a single computer to a cluster of computers to a cloud-based service. The outcome of the project would be a report detailing the project findings, as well as recommendations as to the most effective software.
Think Beyond Data identified two primary database systems that could potentially perform the document database searches: MariaDB and ElasticSearch. Folding Space provided a dataset containing hundreds of thousands of documents and a set of 200 keyword searches to test different software systems, and both were thoroughly evaluated by the team using this data and example queries.
In terms of comparisons between various software tools and computing platforms, a single computer configuration was tested upon. A second configuration with a small cluster of two computers was set up in a lab at the University to act as a private cluster. A final configuration, the cloud service provided by Google, was tested as a third option for Folding Space to consider.
The work programme resulted in two important outputs; a report on ‘Lucene and ElasticSearch: State of the Art and Potential Extensions’ and a comparative study regarding ‘MySQL and ElasticSearch’.
In all instances it was found that ElasticSearch provided much faster searching of documents than MariaDB – the existing software. This was because MariaDB is primarily a database which uses extensions to search documents whereas ElasticSearch is a system constructed specifically for this purpose. Based on the findings from the report and study, Think Beyond Data recommended that Folding Space implement the ElasticSearch system to perform document searches.
Fig.1 The time it takes in seconds for MariaDB to perform 200 supplied queries on the supplied document collection using a single computer system compared to time it takes ElasticSearch using both single and two computer systems.
The ElasticSearch technology enabled much faster loading of the document collections over the current software. Moreover, the adoption of this new technology enabled Folding Space to speed up their searches 16-fold, freeing up staff time by a similar margin as shown in Fig 1.
We also found that a single computer, localised ElasticSearch setup was superior to a cluster or Google cloud service in terms of query times, although the differences were marginal. Streamlining these processes and relieving some of the burden on their team allowed Folding Space to continue taking on new clients whilst maintaining a high quality of service for their existing clients.
Staff at both Folding Space and Think Beyond Data are delighted with the outcome of this project. Further collaboration between Folding Space and Aston University is now taking place under a new ‘Knowledge Transfer Partnership’ (a partially Government-funded programme) which will allow Folding Space to work with experts at Aston University to further innovate within the Big Data field, focusing on ‘Entity Recognition’.
Entity Recognition will enable the quick retrieval of documents related to a particular entity (e.g. an object of interest or a person), and is beneficial when looking at areas such as GDPR compliance.
“The work from Think Beyond Data has led to us greatly improving our technical proposition and commercial endeavours. Indeed, it underpins much of what we now are successfully selling on our new website at www.foldingspace.co.uk. The quality of each report was outstanding and the consequent debate and discussions with us were very informed and involving; there was a real sense of innovative collaboration.”
“Our experience with the ERDF project was constructive, instructive and extremely enjoyable, which is not the norm for most applied research projects. The work undertaken by the excellent Aston University team of Darren Chitty, Peter Lewis, Alina Patelli and Antonio Garcia Dominguez was comprehensive and illuminating. I certainly commend you and your colleagues and the ERDF initiative. It was very worthwhile.”
“The ERDF work lived up to its label of ‘Think Beyond Data’. It enabled us to make the right decision regarding the adoption of an Indexing Engine (ELS: Elastic Search / Lucene) to drive forward our innovative software Discovery Platform.”
“We are now reaping the benefits with greatly increased data scalability, performance and customer support. The ERDF investigation, analysis and guidance was extremely well founded and insightful and this, in turn, has led to success in achieving an Innovate UK KTP award to move the ELS research forward to the next stage, as well as bringing significant commercial benefit.”
Geoffrey N. Smith, Managing Director at Folding Space