February 16, 2018

File Archiving 101 – Corporate File Archiving Basics

The Basics on How to Save Space, Time and Money on Storage

How it works, how to avoid common mistakes, and how our innovative stub-based approach keeps your archived files accessible.

If your business uses computers to store work, records and other sensitive information, you probably have at least a passing familiarity with various archiving software solutions on the market. But too often it’s an afterthought for small- and medium-sized businesses, which leads to throwing good money after bad on ever-increasing storage costs.

Today we’re going to back to the basics of what file archiving is, how it works, and how data archiving solutions are changing the way companies of all sizes approach their information management needs.

What is file archiving?

The first exposure most computer users have to file archiving is in the form of simple .zip or .rar files, which can compress relatively small amounts of data (~4GB[1]) to save hard drive space or ease file transfer. They’re also helpful for distributing software, since programs are often comprised of multiple files—archives help keep everything together in a tidy package.

The ZIP format has been around nearly 30 years, an eon in computer time. In that time, various forms of archiving software have added more features, like encryption (for security), checksums (for error detection), and descriptive metadata (for discovery and file management); at the enterprise level, file archiving solutions (like ShareArchiver) include advanced analytics and policy management tools.

These higher-end solutions are designed for businesses with terabyte / petabyte-level data requirements, which entail high storage costs—by reducing the size of the data, file archiving solutions reduce server or cloud costs commensurately.

Regardless of the sophistication of the program, the foundation of file archiving is data compression.

What is data compression?

Impossible spaces are a super common trope in science fiction and fantasy—think Dr. Who’s TARDIS or the wardrobe from C. S. Lewis’s Narnia series. Thinking through the implications (physically, scientifically, conceptually) of a space that is larger on the inside than it is without is enough to make your eyes cross, but it’s a basic premise of computer file storage.

Data compression (also known as source-coding or bit-rate reduction) is about finding ways to store information (or, more precisely, how information is represented digitally) using fewer bits than the original; in archiving, this process can be reversed, restoring the file to its original state. Lossless compression, which is intended to maintain a perfect representation of the information despite a reduction in size, relies on the principle of statistical redundancy.

A basic example of statistical redundancy in image compression

Image files instruct screens to display pixels of varying colors which, taken as a whole, produce a complete image. Each of these “instructions” adds to the size of the file. But, if the image has areas where the colors don’t change, instead of encoding the data in a highly redundant fashion (i.e. “blue pixel, blue pixel, blue pixel…” ad nauseum) it may be encoded as “400 blue pixels” when the file is compressed. This is known as run-length encoding.

There are numerous other compression strategies which are used singly or in tandem with file archiving software:

Probabilistic Models: Prediction-by-Partial Matching (PPM) algorithms rank the symbols contained within a file according to their probability of appearing based on the sequences of letters (or otherwise) around them. This allows all of the data to be compressed into a single fraction. This fraction can in turn by computed probabilistically to restore the original representation.
Grammar-based Codes: This technique creates a Context-Free Grammar (CFG) for each string of characters which needs to be compressed. This is very useful for extremely repetitive data (like multiple iterations of the same document), as one code can be applied to represent many strings.

In the context of file archiving, these compressed files take up less storage space, at the cost of being less-easily accessible; files must be decompressed or extracted from their archival format to be viewed or edited. At higher data volumes this can be a time-consuming process, especially with more primitive file archiving solutions.

[divi_shortcode id=”5510″]

Common File Archiving Mistakes

Traditional file archiving, such as creating .zip folders, may reduce the size of the files, but it doesn’t offer much help in terms of file management. Zipped files are relatively cumbersome to access, and their contents won’t show up in most file searches, meaning you must remember what each .zip folder contains. This challenge becomes untenable as the amount of data being stored ramps up. This leads to the two most common archiving issues:

Disrupted Work Flow: With conventional programs, archiving files takes them out of circulation. This can seriously disrupt the workflow of team members when files are no longer in their accustomed locations.
Dead Files: These are files that hang around your server or hard drive taking up space which no longer serves any purpose. Because the files are either buried deep within some sub-directory or contained within an unsearchable compressed folder, they often evade cursory attempts to keep things tidy. Eventually, even the soundest file management best practices wear down thanks to exigencies and human error, and you’re forced to either burn significant man-hours removing this digital detritus or eat the cost of paying extra for storage of files you don’t even want!
Redundant Files: The cousin, and progenitor, of the dead file, is the redundant file. On shared servers, in particular, collaborators will each often save an identical copy of the same file in their own workspaces, rather than all accessing the file from a single location. When it comes to, say, Word documents this is a minor sin, but when there are multiple huge and redundant files, it can add up. Doing a mass back-up of all of the files on the server just pours concrete on the problem, preserving these extraneous files till the server slags, the stars grow cold or some intrepid explorer excavates and deletes them manually. (Guess which of these scenarios is most likely.)

Tips for Better File Management

Short of hiring a file management expert to maintain the organization and integrity of your archives, an impossibility for most small- to medium-sized companies, the most effective method is to acquire better tools. A top-quality file archiving solution can save you hundreds or even thousands per month by wiping out dead and redundant files and compressing the size of archived files by 40-50%. You’ll also notice an improvement in performance, with faster file loading and much quicker backups.

ShareArchiver stands out from its peer’s, thanks to its robust analytics and policy management tools.

File Archive Analytics give you the low-down on…

…how your files are actually being used: how often are they asked (and when, and by whom)
…how much of your data is redundant / duplicated
…how much space you’ll save by compressing (before you make the decision)

Archive Policy Management: After you’ve read over your analytics, you’ll have a good idea of which types of files you want to archive. Policy management tools spare you from the arduous process of going through everything file by file. You can designate which to archive by setting filters that will capture little used or outdated files and archive them automatically, and even delete all of those pesky dead or redundant files with a keystroke.

As an optional feature, the program automatically places “stubs” in the directories where your archived files once lived, meaning you don’t have to go hunting for any files you wind up needing down the line. All you have to do is click on the stub and the file is automatically retrieved and decompressed for viewing and editing as usual.

ShareArchiver’s stubs mean archiving is no longer a confusing, disruptive process. It is the end of the archive as a place information goes to die. Instead, the information remains exactly where you need it to be.

Looking to archive your company’s data? Try ShareArchiver free today!

[1] The more-advanced ZIP64 format can compress up to 16EiB (exbibytes/exabytes).