I just finished writing a utility to export a folder hierarchy of files from my existing custom extranet to a SharePoint document library. The custom extranet was database-driven and allowed the user to name a file or folder whatever he or she wished up to a maximum of 500 characters. When I wrote this extranet 6 years ago in classic ASP, I'd just HTML encode whatever name the user wished and store it in the database. Whenever a folder or file was retrieved, it was always by using the ever-so-not-user-friendly URL parameter "id=".
I already knew I would need to remove restricted characters from my folder and files names that SharePoint does not allow. Furthermore, SharePoint's document libraries actually display the full folder path in the URL, which means I'll need to be concerned about the total path length.
My migration plan was to build a physical folder hierarchy for staging the files, then use WebDAV (SharePoint's explorer view for document libraries) for importing the hierarchy into SharePoint within Windows. This method will allow me to keep the utility focused on a simpler task than actually importing the files into SharePoint and make sure I don't have to worry about server timeouts.
Naming restrictions
SharePoint has naming restrictions for sites, groups, folders and files. Since I'm only interested in folders and files, only the following restrictions will be considered.
- Invalid characters:
\ / : * ? " ' < > | # { } % ~ &
- You cannot use the period character consecutively in the middle the name
- You cannot use the period character as the first or the last character
Someone already familiar with this topic will notice that I added the apostrophe to the official restricted character list. During my own testing, SharePoint complained when I uploaded a file with an apostrophe, so I added it to the list.
Length restrictions
Besides naming restrictions, SharePoint also has the following length restrictions (from KB 894630).
- A file or a folder name cannot be longer than 128 characters.
- The total URL length cannot be longer than 260 characters.
128 character limit for folders and files
Regarding the 128 character limit, you can't use SharePoint's UI to get to this limit. The text box's maxlength property is set to 123 for both folders and files. I don't have any inside sources, but my guess is that the SharePoint team did this to make sure the total file name would not exceed 128 characters if the extension was 4 characters (as is the case with Office 2007 file formats like docx and xlsx). The odd thing is that the folder text box is limited to 123 characters as well. However, if you put the document library into Explorer view, you can rename a folder to allow the full 128 characters. I bet there's some reuse going on between the data entry screens for the file and the folder in this case (also something a programmer on the SharePoint team might want to do).
260 character limit for URLs
I've done some WebDAV importing to this particular SharePoint farm in the past, and I'm pretty sure I ran into paths close to the 260 character limit, so I investigated this. I found several instances where the total URL exceeded 260 characters.
KB 894630 mentioned above also says:
To determine the length of a URL, .... convert the string in the URL to Universal Character Set (UCS) Transformation Format 16 (UTF-16) format, and then count the number of 16-bit characters in the string.
However, it should probably say something like "decode the URL first, then count the characters" to make it easier to understand. I created a folder hierarchy to test out the 260 character limit. Following is a URL (notice the %20 space codes) to a test file copied from the address bar of the browser. When the URL is encoded, it contains 346 characters.
http://intranet.xyzco.com/sites/Testing/Documents/A%20longer%20than%20 usual%20folder%20name%20for%20testing/Subfolder%201%20also%20has%20 a%20long%20name/3rd%20level%20subfolder%20about%20related%20 documents/4th%20level%20subfolder%20about%20more%20specific%20 documents/5th%20level%20subfolders%20are%20possible%20in%20this%20 hierarchy/1234567.txt
The decoded URL is:
http://intranet.xyzco.com/sites/Testing/Documents/A longer than usual folder name for testing/Subfolder 1 also has a long name/3rd level subfolder about related documents/4th level subfolder about more specific documents/5th level subfolders are possible in this hierarchy/1234567.txt
Counting the characters in the URL gave me 284. To get closer to 260, I subtracted the 25 characters for the web application:
284 – 25 (Length of http://intranet.abcco.com) = 259 characters
I didn't get a perfect 260, but it's close enough for me to believe that the web application host header name is not included in the limit. This is just a guess on my part, though.
Why the 260 character limit?
A 260 character limit on the URL is interesting, considering both Windows and most internet browsers support paths much longer. It's not merely a coincidence that 260 also just so happens to be the value of the infamous MAX_PATH constant from the Windows API. .NET uses MAX_PATH because .NET relies on the Windows API behind the scenes. There are API workarounds, as discussed on the BCL team blog, but I think it's safe to assume that this limit is imposed on SharePoint by .NET in some way.
Removing invalid characters and patterns using a regular expression
The String object's Replace method doesn't contain an overload for replacing an array of strings, so I looked into using a regular expression to clean folder and file names.
Regular expressions have their own special characters that must be escaped if used for searching:
[ \ ^ $ . | ? * + ( )
Out of these, the following are also SharePoint's invalid characters: * ? | \
These are the characters that will need to be escaped in our regular expression.
After a bit of fiddling, I came up with the following 4 expressions:
[\*\?\|\\/:"'<>#{}%~&]
for removing invalid characters\.{2,}
for replacement of consecutive periods^[\. ]|[\. ]$
for removing spaces and periods from the beginning and end of a folder or file name" {2,}"
for replacement of consecutive spaces (enclosed by quotation marks so you can see the space)
I added a couple of rules to these expressions because of my migration strategy. Since I'm using WebDAV and building a physical folder hierarchy in Windows, I also need to be concerned about any additional restrictions imposed by the OS (a folder or file name can't end with a space). Also, I'm replacing consecutive spaces with a single space.
All expressions are used by Regex.Replace(). Expressions 1 and 3 are replaced by String.Empty. 2 and 4 are replaced by a period and a space, respectively. In regards to the order of the replacements, it's important that the invalid character replacement is applied first. Combining these expressions and replacing at once might create a problem after invalid characters are replaced. For example, the name %.afile.txt would become .afile.txt if done all at once, violating the rule that a period cannot be the first character.
After all replacements have been made, it's still possible to have one of the rules violated. For example, a folder named "Folder one . and . " (ends with space, period, space) would still be invalid after 1 pass of expression 3. It would still be invalid after a 2nd pass. Because of this, the beginning and end rule should be used in a loop until no matches are found. This doesn't help performance, but I was willing to compromise since my largest extranet (9000 files and hundreds of folders) was processed within a minute. Plus, I know the minute I post this someone's going to read it and say, "What was he thinking? It's so much faster to do it this way...".
Fixing length restrictions
To make sure you include as many characters from the original folder or file name as possible, the naming restrictions should be enforced before the length restrictions.
To know how long a file name can be, it's important to know how close we are to the maximum allowed path length. Since I'm using a physical file hierarchy to stage the files, I can simply check the current folder's path length. Instead of going into too much detail about this, take a look at the maxLength integer in the following code listing. maxLength is what I used to determine how long a folder or file could be given the current path length.
An example method in C#
Following is the method I ended up with, along with some global variable initializations. You'll notice I added the tab character to the invalid characters list. During an export, I found a file name with embedded tab characters, so it was added to the list as well.
private const int MAXFOLDERLENGTH = 128, MAXFILELENGTH = 123;
private int MAXURLLENGTH = 259;
private Regex invalidCharsRegex =
new Regex(@"[\*\?\|\\\t/:""'<>#{}%~&]", RegexOptions.Compiled);
private Regex invalidRulesRegex =
new Regex(@"\.{2,}", RegexOptions.Compiled);
private Regex startEndRegex =
new Regex(@"^[\. ]|[\. ]$", RegexOptions.Compiled);
private Regex extraSpacesRegex =
new Regex(" {2,}", RegexOptions.Compiled);
/// <summary>
/// Returns a folder or file name that
/// conforms to SharePoint's naming restrictions
/// </summary>
/// <param name="original">
/// The original file or folder name.
/// For files, this should be the file name without the extension.
/// </param>
/// <param name="currentPathLength">
/// The current folder's path length
/// </param>
/// <param name="maxItemLength">
/// The maximum allowed number of characters for this file or folder.
/// For a file, it will be MAXFILELENGTH.
/// For a folder, it will be MAXFOLDERLENGTH.
/// </param>
private string GetSharePointFriendlyName(string original
, int currentPathLength, int maxItemLength)
{
// remove invalid characters and some initial replacements
string friendlyName = extraSpacesRegex.Replace(
invalidRulesRegex.Replace(
invalidCharsRegex.Replace(
original, String.Empty).Trim()
, ".")
, " ");
// assign maximum item length
int maxLength = (currentPathLength + maxItemLength > MAXURLLENGTH)
? MAXURLLENGTH - currentPathLength
: maxItemLength;
if (maxLength <= 0)
throw new ApplicationException(
"Current path is too long for importing into SharePoint");
// return truncated name if length exceeds maximum
if (friendlyName.Length > maxLength)
friendlyName = friendlyName.Substring(0, maxLength - 1).Trim();
// finally, check beginning and end for periods and spaces
while (startEndRegex.IsMatch(friendlyName))
friendlyName = startEndRegex.Replace(
friendlyName, String.Empty);
return friendlyName;
}
A typical call to this method would look similar to the following. In this listing, parent is a DirectoryInfo object pointing to the current folder.
fileName = GetSharePointFriendlyName(fileName
, parent.FullName.Length + 1, MAXFILELENGTH);
folderName = GetSharePointFriendlyName(folderName
, parent.FullName.Length + 1, MAXFOLDERLENGTH);
Testing the import to SharePoint using empty files
The best test would be to actually upload the files via WebDAV to a staging environment. However, if you receive an error message because of name restrictions or path length during the process, it's difficult to pick back up where the error occurred.
To quickly preview an upload, I modified my export utility to create empty files instead of building the folder hierarchy with the actual files. You can use these for a mock import in WebDAV even though SharePoint's UI will not allow you upload an empty file. The following line was used to create the files.
using (StreamWriter sw = File.CreateText(fileName.ToString())) { };
The using statement makes sure the StreamWriter is closed after the file is created. I learned this the hard way when the OS threw an exception about a file being locked.
Another benefit of using empty files is to preview the migration for your users. They can browse the document library and offer their approval. Since we've had to remove some characters and possibly truncate names, this could be very important to the success of the migration.
Export Utility
Just to offer some eye candy for this post, I ended up with something that looked like this: