Generating File Lists

Two listings must be done to compare. One is the Inventory Listing, and the other is the Remote Listing.

Inventory Listing

Dynamo Consistency only interacts with Dynamo at two points during a check. First, it gets a listing of what should be at a site. The next time it interacts with Dynamo is at the end when it reports results.

The inventory is queried before the site is listed remotely due to possible race conditions. It is not uncommon for a site listing to take multiple days. In the meanwhile, two things can change in the inventory. A file can be deleted from a site or it can be added to a site. An added file is ignored by setting IgnoreAge in the Configuration to a large enough value. Files that are deleted during the remote listing are filtered out by checking deletion requests.

There are currently multiple ways to get the site contents from Dynamo. One is to access the MySQL database use for Dynamo storage directly. This will work as long as the schema does not change. A more reliable way to keep up with major changes in Dyanmo is to use the Dynamo inventory object. This method is less optimized when working with the MySQL storage plugin, but will work for different schemas and any different storage types that are added in the future.

The type of inventory lister is selected via command line options, or by setting dynamo_consistency.opt.V1 to True or False before importing any modules that rely on the backend. By implementing the three modules inventory, registry, and siteinfo, described in Back End Requirements, any other method of communicating with an inventory can be added.

After selecting the backend, the inventory can be listed transparently using the method shown in Introduction:

from dynamo_consistency import inventorylister

listing = inventorylister.listing(sitename)

Here, listing is a dynamo_consistency.datatypes.DirectoryInfo object. DirectoryInfo contains meta data about a directory, such as its modification timestamp and name. It also holds a list of sub-directories, in the form of DirectoryInfo objects, and a list of files. The files are represented as dictionaries containing the name, size, and modification time of the file. Each file and DirectoryInfo also stores a hash of the meta data. The DirectoryInfo hash includes information from the object’s files and subdirectories too. This is to speed up the file tree comparison, described in Comparison Algorithm.

Remote Listing

The remote listing is equally flexible. The factory function dynamo_consistency.backend.get_listers() reads the Configuration file to determine the type of lister for a site. There are currently three different classes implemented, and more can be added by extending the dynamo_consistency.backend.listers.Lister class and implementing its ls_directory method. The three current listers are the following:

  • dynamo_consistency.backend.listers.XRootDLister - This listing object uses the XRootD Python module to connect to and query each site.
  • dynamo_consistency.backend.listers.GFalLister - This listing object uses the gfal-ls command line tool to list remote sites.
  • dynamo_consistency.backend.listers.XRootDLister - This listing object opens a subshell using the xrdfs command line tool and queries the remote site.

Once the type of lister is set in the Configuration, the contents of the remote site can be listed transparently:

from dynamo_consistency import remotelister

listing = remotelister.listing(sitename)

This takes much longer than the Inventory Listing, since every directory of the site needs to be queried. The layer between the listing class and the final output creates multiple connections and works on two queues with multiple threads. There is the input queue, which is a list of directories that still need to be listed, and an output queue which holds the result of each directory listed so far. The workflow of each queue is shown below.

[node distance=0.5cm, every edge/.style={arrow}]
\node (start) [goodstep] at (0, 0) {List directory};
\node (good) [goodstep, right=of start] {Was the listing successful?};
\node (yes) [below=of good] {Yes};
\node (outqueue) [async, below=of yes] {Output name of \\ this directory \\
  and lists of \\ subdirectories and files};
\node (inqueue) [async, left=of yes] {For each in \\ queue};
\node (master) [goodstep, right=of outqueue] {Get this from master \\
  add directories \\ add files};
\node (try) [async, above=of good] {Try again};
\draw [->, thick] (good) -- node [left] {No} (try);
\node (storestart) [goodstep, left=of outqueue] {Add starting directory \\
  to listing queue};
\path
(start) edge (good)
(inqueue) edge [bend left] node [left] {} (start)
(outqueue) edge (master)
(good) edge (yes)
(yes) edge (inqueue) edge (outqueue)
(try) edge [bend right] node [above] {} (start)
(storestart) edge (inqueue);

Listing algorithm. TODO: Make better colors and words and stuff