First, I should warn you that this is mostly a case study about regular expressions. Second, I will only be explaining how I set up the Master Expression, Replacement Expression, and Link Expression on the Preferences ; File Relations ; Detection tab. I will document how I set up the "What to propagate:" field on the Versioning tab in a later post (after figure all that mess out, myself).
After asking Mario, he told me that IMatch uses the boost C++ libraries for regular expression processing. I dug through their documentation and found that the most pertinent information for us is here:
https://www.boost.org/doc/libs/1_66_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html. This is a very terse summary of what regular expression metacharacters and features should be available in IMatch. My preferred source of information about regular expressions right now is a book. So, if you want more information about all the features of regular expressions, you will have to look elsewhere. I will only explain the features that I use in my Relation Definition. As I said, this is a case study.
In my research about regular expressions, I found a very good web-based regex tool at
https://regexr.com. It allows you to enter a regular expression and a set of text to search. At the bottom, it then displays a complete explanation of every single bit of your regular expression. Plus it allows you to share links to your saved regular expressions. That means I can include links and you can then explore my saved regular expressions on your own. You can even modify them and save them for your own future use.
So here we go.
Master Expression:Here is the regular expression I used for my Master Expression:
^[\d~]{4}-[\d~]{2}-[\d~]{2}_[MFAO][DOQNSTPCVAG](?:\d{9}|r\d{5}[ns]\d\d)_[^_]{0,20}_00_!Origin\..{3,5}$And here is the link:
https://regexr.com/56durHere is the text. The lines that were selected are in red here.
2003-06-15_FD000000011__00_!Origin.JPG 2003-06-15_FD000000012__00_!Origin.JPG 2009-08-08_FD000000013_Sheri & Mason_00_!Origin.JPG 2009-08-08_FD000000014_Noah & Mason_00_!Origin.JPG 2012-05-19_MD000000017_New Motorcycle_00_!Origin.jpg 2012-05-19_MD000000017_New Motorcycle_1A_P8x10.tiff
2012-05-19_MD000000017_New Motorcycle_00_Screen.jpg
2019-02-23_MD000000018__00_!Origin.dng 2019-02-26_MD000000019__00_!Origin.dng 2019-03-05_MD000000020__00_!Origin.ORF 2019-03-15_MD000000021__00_!Origin.dng 2020-04-18_FP000000015_Benjamin Sanders_1A_Web-M.GIF
2020-04-18_FP000000016_Benjamin Sanders_00_!Origin.jpg 1972-06-~~_MTr00017s12_Some Scanned Slide_00_!Origin.jpg AFile.JPG
AFile_!Origin.JPG
Now, you really need to just go to the link above. Open it in a separate window and position it beside this window so you can refer back and forth. I am not going to repeat all of the explanation that is already available at regexr.com. What I will do is point out some special things and why I chose to do them.
The first thing you will notice is that I use very detailed regular expressions. I do not want to take any chance that there may be files in the future that are confused for master files when they really aren't. So, I define every tiny part of my date, using the
[\d~]{4}-[\d~]{2}-[\d~]{2} part of my expression. Notice the '
~' character that is included with the digits metacharacter (
\d) in the character sets/classes (
[\d~]). This is because some of my file names, especially images scanned from old slides, prints, and negatives will have the tilde character for parts of the date to indicate that I don't know that part of the date. This is explained in my File Name Scheme posted here:
https://www.scribd.com/document/465143359/Grant-s-Photo-File-Naming-Scheme, and discussed in the post here:
https://www.photools.com/community/index.php?topic=10384.0. Look at the third file name from the bottom (above). I know that slide was taken in June of 1972 but I don't know the exact day. So I used tildes for the day. The regular expression still matched the file name. Now try editing the regular expression and remove the first tilde after the first
/d. You will see that the 1972… image is no longer matched.
Now the next major chunk is the
[MFAO][DOQNSTPCVAG] part. That is just two separate character classes (sometimes called character sets), one for each of my special code characters that come before my serial numbers. The first letter can be any one of M, F, A, or O. If that character is not one of those letters, then there will be no match. There are more options for the next letter, but it is still the same idea. Don't let the fact that there are lots of letters there fool you. These just stand in for two of the characters in the file name.
The next chunk is a bit more complicated:
(?:\d{9}|r\d{5}[ns]\d\d). It really just defines two separate options for how my serial numbers can be formatted. Notice how there is an opening and closing parentheses. That defines a group. Now, look for the vertical bar in there. That separates the two options available within that group. In Regex lingo, that is called an "alternation" merely because it indicates more than one alternative way to make a match.
But, before I get to the options, I need to explain the
?:, right after the opening parentheses. What a lot of people don't realize is, when a regular expression processing engine processes a regular expression it also automatically creates "variables" that hold whatever was matched within any set of parentheses. There are ways to make use of those "variables," as they are called, within a regular expression, but I won't go into that here. Just know that this takes extra processing time and memory. Putting the
?: there tells the regex engine to not bother to create those "variables." That way the expression is processed faster, which can make a lot of difference if the expression has to be processed thousands of times over and over again, like when refreshing the relations for an entire database full of images. This is often called a "Non-Capturing Group."
OK, now, the first option, is just that there could be nine digits, just as defined in my file format scheme document. This is what will be the case for the vast majority of files. So, I put that as the first alternative. When regular expressions are processed, the first alternative that matches wins, and none of the other alternatives are even looked at. So, if there are nine digits, then the regex engine can just move on.
The second option is designed to match when I scanned a slide or negative. According to my file name scheme, that will be formatted as
r#####s## or
r#####n##. So, my regular expression looks for a lower case 'r' followed by five digits, followed by either an 's' or 'n', followed by two digits. I just used
/d/d instead of
/d{2} simply because the former is one character shorter. So that speeds up processing just a hair as well. (OK, not that you would likely ever notice.)
The next chunk, the
_[^_]{0,20}, matches from zero to 20 characters of anything except an underscore, for the description. The underscore is used as the separator character, so it cannot be used as part of the description. That is what that
^ does as the first character in the character class. It "negates" what is in the character class. Because I defined that character class as anything besides an underscore it will allow literally any character to be in the description. Technically, even tab characters, control characters, and newlines. But I know those will never be in file names, so I don't have to worry about those. When designing a regular expression, sometimes you have to keep in mind what could possibly actually show up as a possible match. In this case, those special characters can be ignored, thus making my expression simpler (so to speak).
The next chunk,
_00_!Origin, is just literal text. This is actually the main part that truly identifies my files as the "Master" file. Technically, I could have just looked for this and always assumed that any file with that in the name is treated as a master file. But I don't know if I will come up with some other scheme in the future where I want things to be treated differently. Plus, I'm a bit "extra" when it comes to stuff like this. So sue me. I'm not telling you you have to have such a detailed regular expression. I'm just showing you how to do so if you want to.
Finally, there is the file extension. I am allowing for 3 - 5 of any character after the period. Naturally the
\. is an escaped '
.', which would normally be a metacharacter but now just matches a regular period. But you knew that. Again, I don't really have to worry about special characters showing up in a file extension, so I can just use the '
.' metacharacter to match anything.
Notice how, using this file name scheme, I can define all of my Version Relations with just one Relation Definition. I don't need to create a separate Version relation for each and every different format that my cameras may produce. Let's say my camera produces a
.RAW file and a
.JPG file as a thumbnail. I can name the
.RAW file as "
BaseName_00_!Origin.RAW" and my thumbnail as "
BaseName_00_Thumb.JPG." and it will work. If another camera uses
.ORF and
.TIF files as thumbnails, I just name the files "
BaseName_00_!Origin.ORF" and "
BaseName_00_Thumb.TIF" and away we go. Later I will write an explanation of how I create a separate relation definition for these thumbnail images. Although, most of the time I won't need them because the thumbnails that IMatch generates from my raw files look pretty darn good anyway, so I won't need separate thumbnail files. For two of my cameras, I simply delete their thumbnail
.JPG files as part of my ingest process.
Continued...