摘要

Today, the world is increasingly awash in more and more unstructured data, not only because of the Internet, but also because data that used to be collected on paper or media such as film, DVDs and compact discs has moved online [1]. Most of this data is unstructured and in diverse formats such as e-mail, documents, graphics, images, and videos. In managing unstructured data complexity and scalability, object storage has a clear advantage. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data. It can detect common embedded data for the first backup across completely unrelated files and even when physical block layout changes. However, almost all of the current researches on data de-duplication do not consider the content of different file types, and they do not have any knowledge of the backup data format. It has been proven that such method cannot achieve optimal performance for compound files. In our proposed system, we will first extract objects from files, Object_IDs are then obtained by applying hash function to the objects. The resulted Object_IDs are used to build as indexing keys in B+ tree like index structure, thus, we avoid the need for a full object index, the searching time for the duplicate objects reduces to O(log n).We introduce a new concept of a duplicate object resolver. The object resolver mediates access to all the objects and is a central point for managing all the metadata and indexes for all the objects. All objects are addressable by their IDs which is unique in the universe. The resolver stores metadata with triple format. This improved metadata management strategy allows us to set, add and resolve object properties with high flexibility, and allows the repeated use of the same metadata among duplicate object.

全文