How I created the Relation Definitions for my convoluted file naming scheme:

Started by GrantRobertson, June 14, 2020, 05:56:05 AM

Previous topic - Next topic

GrantRobertson

First, I should warn you that this is mostly a case study about regular expressions. Second, I will only be explaining how I set up the Master Expression, Replacement Expression, and Link Expression on the Preferences ; File Relations ; Detection tab. I will document how I set up the "What to propagate:" field on the Versioning tab in a later post (after  figure all that mess out, myself).

After asking Mario, he told me that IMatch uses the boost C++ libraries for regular expression processing. I dug through their documentation and found that the most pertinent information for us is here: https://www.boost.org/doc/libs/1_66_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html. This is a very terse summary of what regular expression metacharacters and features should be available in IMatch. My preferred source of information about regular expressions right now is a book. So, if you want more information about all the features of regular expressions, you will have to look elsewhere. I will only explain the features that I use in my Relation Definition. As I said, this is a case study.

In my research about regular expressions, I found a very good web-based regex tool at https://regexr.com. It allows you to enter  a regular expression and a set of text to search. At the bottom, it then displays a complete explanation of every single bit of your regular expression. Plus it allows you to share links to your saved regular expressions. That means I can include links and you can then explore my saved regular expressions on your own. You can even modify them and save them for your own future use.

So here we go.

Master Expression:
Here is the regular expression I used for my Master Expression:

   ^[\d~]{4}-[\d~]{2}-[\d~]{2}_[MFAO][DOQNSTPCVAG](?:\d{9}|r\d{5}[ns]\d\d)_[^_]{0,20}_00_!Origin\..{3,5}$

And here is the link: https://regexr.com/56dur

Here is the text. The lines that were selected are in red here.

   2003-06-15_FD000000011__00_!Origin.JPG
   2003-06-15_FD000000012__00_!Origin.JPG
   2009-08-08_FD000000013_Sheri & Mason_00_!Origin.JPG
   2009-08-08_FD000000014_Noah & Mason_00_!Origin.JPG
   2012-05-19_MD000000017_New Motorcycle_00_!Origin.jpg
   2012-05-19_MD000000017_New Motorcycle_1A_P8x10.tiff
   2012-05-19_MD000000017_New Motorcycle_00_Screen.jpg
   2019-02-23_MD000000018__00_!Origin.dng
   2019-02-26_MD000000019__00_!Origin.dng
   2019-03-05_MD000000020__00_!Origin.ORF
   2019-03-15_MD000000021__00_!Origin.dng
   2020-04-18_FP000000015_Benjamin Sanders_1A_Web-M.GIF
   2020-04-18_FP000000016_Benjamin Sanders_00_!Origin.jpg
   1972-06-~~_MTr00017s12_Some Scanned Slide_00_!Origin.jpg
   AFile.JPG
   AFile_!Origin.JPG

Now, you really need to just go to the link above. Open it in a separate window and position it beside this window so you can refer back and forth.  I am not going to repeat all of the explanation that is already available at regexr.com. What I will do is point out some special things and why I chose to do them.

The first thing you will notice is that I use very detailed regular expressions. I do not want to take any chance that there may be files in the future that are confused for master files when they really aren't. So, I define every tiny part of my date, using the [\d~]{4}-[\d~]{2}-[\d~]{2} part of my expression. Notice the '~' character that is included with the digits metacharacter (\d) in the character sets/classes ([\d~]). This is because some of my file names, especially images scanned from old slides, prints, and negatives will have the tilde character for parts of the date to indicate that I don't know that part of the date. This is explained in my File Name Scheme posted here: https://www.scribd.com/document/465143359/Grant-s-Photo-File-Naming-Scheme, and discussed in the post here: https://www.photools.com/community/index.php?topic=10384.0. Look at the third file name from the bottom (above). I know that slide was taken in June of 1972 but I don't know the exact day. So I used tildes for the day. The regular expression still matched the file name. Now try editing the regular expression and remove the first tilde after the first /d. You will see that the 1972... image is no longer matched.

Now the next major chunk is the [MFAO][DOQNSTPCVAG] part. That is just two separate character classes (sometimes called character sets), one for each of my special code characters that come before my serial numbers. The first letter can be any one of M, F, A, or O. If that character is not one of those letters, then there will be no match. There are more options for the next letter, but it is still the same idea. Don't let the fact that there are lots of letters there fool you. These just stand in for two of the characters in the file name.

The next chunk is a bit more complicated: (?:\d{9}|r\d{5}[ns]\d\d). It really just defines two separate options for how my serial numbers can be formatted. Notice how there is an opening and closing parentheses. That defines a group. Now, look for the vertical bar in there. That separates the two options available within that group. In Regex lingo, that is called an "alternation" merely because it indicates more than one alternative way to make a match.

But, before I get to the options, I need to explain the ?:, right after the opening parentheses. What a lot of people don't realize is, when a regular expression processing engine processes a regular expression it also automatically creates "variables" that hold whatever was matched within any set of parentheses. There are ways to make use of those "variables," as they are called, within a regular expression, but I won't go into that here.  Just know that this takes extra processing time and memory. Putting the ?: there tells the regex engine to not bother to create those "variables." That way the expression is processed faster, which can make a lot of difference if the expression has to be processed thousands of times over and over again, like when refreshing the relations for an entire database full of images. This is often called a "Non-Capturing Group."

OK, now, the first option, is just that there could be nine digits, just as defined in my file format scheme document. This is what will be the case for the vast majority of files. So, I put that as the first alternative. When regular expressions are processed, the first alternative that matches wins, and none of the other alternatives are even looked at. So, if there are nine digits, then the regex engine can just move on.

The second option is designed to match when I scanned a slide or negative. According to my file name scheme, that will be formatted as r#####s## or r#####n##. So, my regular expression looks for a lower case 'r' followed by five digits, followed by either an 's' or 'n', followed by two digits. I just used /d/d instead of /d{2} simply because the former is one character shorter. So that speeds up processing just a hair as well. (OK, not that you would likely ever notice.)

The next chunk, the _[^_]{0,20}, matches from zero to 20 characters of anything except an underscore, for the description. The underscore is used as the separator character, so it cannot be used as part of the description. That is what that ^ does as the first character in the character class. It "negates" what is in the character class. Because I defined that character class as anything besides an underscore it will allow literally any character to be in the description. Technically, even tab characters, control characters, and newlines. But I know those will never be in file names, so I don't have to worry about those. When designing a regular expression, sometimes you have to keep in mind what could possibly actually show up as a possible match. In this case, those special characters can be ignored, thus making my expression simpler (so to speak).

The next chunk, _00_!Origin, is just literal text. This is actually the main part that truly identifies my files as the "Master" file. Technically, I could have just looked for this and always assumed that any file with that in the name is treated as a master file. But I don't know if I will come up with some other scheme in the future where I want things to be treated differently. Plus, I'm a bit "extra" when it comes to stuff like this. So sue me. I'm not telling you you have to have such a detailed regular expression. I'm just showing you how to do so if you want to.

Finally, there is the file extension. I am allowing for 3 - 5 of any character after the period. Naturally the \. is an escaped '.', which would normally be a metacharacter but now just matches a regular period. But you knew that. Again, I don't really have to worry about special characters showing up in a file extension, so I can just use the '.' metacharacter to match anything.

Notice how, using this file name scheme, I can define all of my Version Relations with just one Relation Definition. I don't need to create a separate Version relation for each and every different format that my cameras may produce. Let's say my camera produces a .RAW file and a .JPG file as a thumbnail. I can name the .RAW file as "BaseName_00_!Origin.RAW" and my thumbnail as "BaseName_00_Thumb.JPG." and it will work. If another camera uses .ORF and .TIF files as thumbnails, I just name the files "BaseName_00_!Origin.ORF" and  "BaseName_00_Thumb.TIF" and away we go. Later I will write an explanation of how I create a separate relation definition for these thumbnail images. Although, most of the time I won't need them because the thumbnails that IMatch generates from my raw files look pretty darn good anyway, so I won't need separate thumbnail files. For two of my cameras, I simply delete their thumbnail .JPG files as part of my ingest process.

Continued...

GrantRobertson

... Continued from main post.

Replacement Expression:
Here is where things are a little dicey. There is actually a bug in the 2020.5.6 version of IMatch (and apparently many before it) where the parsing (which simply means figuring out what the characters you put there mean) of the replacement expression was wrong. It didn't work when there was a '/' as the first character, even though that is what the help file says to do. However, it does work if you leave off the first '/'.  But Mario has fixed that bug and the fix will be available in the next release. So, if you are using any version after 2020.5.6, then put in a '/' as the first character. Otherwise, leave it off. For now, I will describe it as it works in 2020.5.6.

All I am doing here is replacing the text "_00_!Origin" with nothing. So I put "_00_!Origin//" in the Replacement Expression. What that does is make everything up to that point be considered the "Base Name" of the master file and version files. Any files with the same name up to, but not including the "_00_!Origin" part, will be considered to be versions. If you read my File Name Scheme document, you know I use that 00 or digit, letter pairs to indicate some position in a hierarchy of modified files. But all those files will be versions of the original file. So I want IMatch to treat them as such.

Link Expression:
For the Link Expression, take a look at the saved regular expression at https://regexr.com/56duu. Notice that I had to manually insert the "Base Name" for one of my files at the first part of the expression because the regexr.com site has no idea what the {name} token is in IMatch.

Here is how it will appear in the Relation Definition dialog:
^{name}_(?!00_!Origin)(?:00|(?:[0-9][A-Z])+)_(?:[^_.]{1,7})\..{3,5}$

Now that I figured out how to get the Link Expression to work, I was able to use the {name} token as a stand in for the "Base Name" of my files. That base name is always terminated with an underscore, but I wanted that to be very apparent so I took it out using the Replacement Expression and am putting it back in here.

But what is that next part, the (?!00_!Origin) part? This is what is called a "Negative Lookahead" or "Negative Lookahead Group." It is the '?!', right after the opening parentheses, that marks this as a Negative Lookahead.  I've seen a lot of hard to understand explanations for how that works so I'll try to explain it better here: OK, the regex engine starts at whatever was the last matched character, the underscore. Then it "looks ahead" to see if it can match what is defined after the '?!'. If it does match whatever is there, then the regex engine decides that the whole thing is a NO MATCH and stops there. In IMatch, this means that file will not be considered to be a version file. If it does not match what is in the Negative Lookahead, then the engine starts back at that last previously matched character, the underscore,  and continues to try to match using the rest of the expression. What this does is make darn sure that none of the master files are ever, EVER considered to be version files for something else. Just in case I screw up the naming of files in some idiotic way.

The next part is (?:00|(?:[0-9][A-Z])+), which is a bit complicated as well. What you see is one of those "Non-Capturing Groups" with yet another of those "Non-Capturing Groups" inside it. Just ignore the ?: part for now and you will see it is nothing more than some nested sets of parentheses, just like you see all the time. You will also see that this is an "Alternation," meaning there are alternatives separated by a vertical bar.

So, the first alternative is just a "00" again. Wait, what? That is because it is possible that many times I will simply take the original, unmodified file and save out some different versions that are only different in that they were exported at a lower resolution for web or e-mail use. There has been no other modifications to the original, so there was no need to get into that complicated branching structure that I described in my File Name Scheme document. What this also does is allow you to see right away when some of the versions of a file can really just be deleted after they were used, because they can easily just be recreated by running some export script again. Later, you can search for any versions that have _00_ in this part of the name and delete them without worry of losing any real data. I'll try to write up an explanation for creating a filter for that later on.

Now, the next alternative looks for pairs of characters that are a digit followed by an upper case letter. There can be any number of those pairs as long as there is at least one pair. That is why I needed the nested set of parentheses. The pattern of digit-letter must be repeated, not just any digit or any letter in any order. So that '+' had to be outside of a group that included the digits character class followed by the letters character class, in that order.

Next comes my use code. It can include any character except an underscore or a period. Notice how the period does not need to be escaped when it is inside a character class. This is part of what keeps confusing the heck out of people. Characters mean different things when they are inside a character class, and sometimes, even when they are at the front or end of a character class. But I won't get into all that here. My use code can be from one to seven characters.

And then the extension again. Because I may save files using different file formats, the file extension can be completely different from that of the master file.

Looking back at the regesr.com site, you can see that the expression matched two files as versions of the original "New Motorcycle" file.


Alrighty then! Are you confused enough yet? I'm hoping that this case study, in the context of something you already know and care about, will help you understand regular expressions a bit better. I know, it took me weeks of reading almost a dozen books and websites to understand enough to pull all this together. And I was a network manager and a computer science major (with a 4.0 in my major). Of course, half the time, reading the book was so boring that I literally had to take a nap after reading for an hour or two. One day I actually had to take two naps. Naps are good.