Long before I even knew what SSIS was, I was doing what some might call ad-hoc programming when I was given a spreadsheet and told that they (the business, user) needed this data saved in a database. I looked at the spreadsheet and saw that there were 5,000 rows and I knew I didn't want to do this manually, so I did the next best thing. Regex. Here is how I did it.
Copied the data into a text editor
These days I like to use VS Code as my text editor as it has a lot of plugins and built-in support for source control (which is honestly the big reason why I use it). So, I took my data, which looks like this:
User data |
The data in VS Code |
Select the data with regex
In order to move this data into a SQL table, I need to INSERT the data in a SQL table. The easiest way to do this is to pull out each of the values in the data and insert the values into a string like so:
INSERT INTO SQLDBTable VALUES ('value1', 'value2', 'value3', .....)
In order to do this correctly, we need the correct regex to pull each of the values out of our data. Let's tackle each piece of data (first name, last name, birthday, etc.) one at a time and we'll wrap it up once we've matched everything.
First name
From the looks of the sample data we are given, the first names of each of our people are consistent - just letters. We can use simple regex to capture the first name from the data.
^(\w+)\t
But wait! How do we know the first name ends in a tab character? In VS Code, we can turn on visibility for whitespace characters (which includes tab characters) by holding down ctrl/cmd + shift and clicking p and typing "Preferences" and selecting Preferences: Open Settings (VS Code version 1.28.2). Turn Render Whitespace to All:
Turning on Render Whitespace |
Now, if you look at our data you will see a very small right arrow, which indicates the character is a tab:
The arrow means this is a tab character |
Note - the small dot means a space character.
Last name
A little bit more complicated than the first name, I see from our data sample that a few last names have spaces, as well as apostrophes and hyphens. Luckily, this isn't hard for us in regex.
^(\w+)\t([\w'\- ]+)\t
This regex is matching letters, apostrophes, hyphens and spaces within a character set (also called a character class). We use the + (which is a quantifier) to essentially say "match one or more of the characters as you can within the character set (all the tokens within the brackets [ ])." We follow this by another tab character.
Note - we are using the parentheses to capture our matches to save them for later. If we didn't capture the characters that match the regex, we would not be able to pull out the values from our data!
Birthday
Unfortunately, the data in the birthday column is not consistent - which isn't that uncommon in real-world scenarios at all. We will need to handle this somehow in regex, and I feel the best way in this case is to capture each of the data points (month, day and year) separately and we can combine them together later.
^(\w+)\t([\w'\- ]+)\t(\d+)[\-/](\d+)[\-/](\d+)\t
This regex is matching one or more digits, \d, followed by a hyphen or forward slash - repeated. You notice that we have to escape the hyphen because the hyphen can be used in ranges and we do not want that functionality here.
Income
Again, inconsistent data here too. We ended up speaking with the "business" and looking at our existing SQL table and only need to insert the numeric value into the SQL table. We will be dropping the commas, dollar sign and any decimal found in the data.
^(\w+)\t([\w'\- ]+)\t(\d+)[\-/](\d+)[\-/](\d+)\t\$?(\d+),?(\d+)\.?\d?\d?\t
This regex uses a little bit of the optional quantifier (the question mark ?) to make certain characters match 0 or 1 time (effectively what that means is the character or token before the question mark is optional). We mark the dollar sign optional, which also needs to be escaped because the dollar sign is an end-of-string anchor).
The decimal value is also marked as optional, by adding a question mark after each of the tokens.
Note - the period in regex stands for a wildcard, meaning any character. Since we just want to match against a literal period, we need to escape it with a preceding backslash. This is true for any character we want to escape, the character must be preceded by a backslash.
The email field is simpler from the last two fields, so we get a little bit of relief! The regex for the email field will be very similar to the last name field.
^(\w+)\t([\w'\- ]+)\t(\d+)[\-/](\d+)[\-/](\d+)\t\$?(\d+),?(\d+)\.?\d?\d?\t([\w@\-!]+)\.?c?o?m?\t
This regex matches one or more letters, @ signs, hyphens and exclamation marks followed by optional ".com".
Address 1
I spoke with the business, and they assured me there would be no tab characters in the address 1 field, so we are able to write very simple regex to capture the address 1 field.
^(\w+)\t([\w'\- ]+)\t(\d+)[\-/](\d+)[\-/](\d+)\t\$?(\d+),?(\d+)\.?\d?\d?\t([\w@\-!]+)\.?c?o?m?\t([^\t]+)\t
Our regex captures one or more characters that is not a tab character. It does this through the use of a negated character set. A negated character set matches any character that is not within the set.
I held off showing this regex to introduce us to other types of regex capturing for the first or last name, when in reality the above regex for address 1 would suffice for first and/or last name.
Zip
We do not want the optional 4 digits for zip code, so all we need to do to finish off our regex is just capture the 5 digits and ignore anything at the end of the string.
^(\w+)\t([\w'\- ]+)\t(\d+)[\-/](\d+)[\-/](\d+)\t\$?(\d+),?(\d+)\.?\d?\d?\t([\w@\-!]+)\.?c?o?m?\t([^\t]+)\t(\d{5}).*$
This regex matches 5 digits and then we have a wildcard match .* to the very end of the string $. The asterisk is a quantifier, just like the +, that matches 0 or more characters of the preceding token. Again, because the period is a wildcard token, the following tokens .* will match everything.
Reformatted the data
Now that we have our full regex, we can select everything in our text editor and do a regex find/replace. Before you do that, do delete the column headers! We do not want to cause issues with the presence of the column headers in our file. We can do a regex find/replace by opening up the find window by using ctrl/cmd + f and selecting the .* option in VS Code to enable regex matching.
Next, in our replace input box, we will use the following template to create our SQL INSERT statements. The numbers after the $ indicate capture group, that is $1 will be replaced by the first capture group, which is the first name. $2 will be replaced by the second capture group, which is the last name, and so on. You can see what value I used for the replace input box below
INSERT INTO Data VALUES ('$1', '$2', '$5-$3-$4', '$6$7', '$8.com', '$9', '$10')
About to get our SQL INSERT statement! |
Hit replace all or ctrl/cmd + alt + enter. This is our result:
Our new SQL INSERT statements |
Finishing touch
Here are a few tips to make your SQL a little more professional, add the following touches in order to wrap the SQL in a transaction and make it log output in case someone else has to run your SQL in a different environment / server.
BEGIN TRANSACTION
GO
INSERT INTO SQLDBTable VALUES ('Ray', 'Smith', '1987-5-2', '120000', 'rsm@aol.com', '123 Pleasant St.', '42511')
INSERT INTO SQLDBTable VALUES ('Richard', 'O' Charles', '2000-1-1', '10000', 'r_0charlz@ff.com', 'Benjamin road', '98292')
INSERT INTO SQLDBTable VALUES ('Rick', 'Xavier', '1994-10-22', '15000', 'rxavier@hotmail.com', 'Los Gatores', '77261')
INSERT INTO SQLDBTable VALUES ('Richardson', 'Chesterton', '1958-7-30', '80000', 'chestertonrichard@gmail.com', 'P.O. box 561', '23443')
INSERT INTO SQLDBTable VALUES ('Ricky', 'Lemonz', '1999-11-3', '22000', 'lemonz@sc.com', 'Mary ln.', '58273')
INSERT INTO SQLDBTable VALUES ('Rebecca', 'Fiddler', '2006-9-14', '500', 'fiddlerontheroof@ilovemusic.com', '2867 Bear Creek', '84271')
INSERT INTO SQLDBTable VALUES ('Rachel', 'Neber-Julius', '1977-4-21', '80000', 'neber-juliusthequeen!@me.com', 'Apt Complex', '13331')
GO
PRINT '******* Transaction rolled back - commit if there are no errors'
GO
ROLLBACK TRANSACTION
I was diagnosed as HEPATITIS B carrier in 2013 with fibrosis of the
ReplyDeleteliver already present. I started on antiviral medications which
reduced the viral load initially. After a couple of years the virus
became resistant. I started on HEPATITIS B Herbal treatment from
ULTIMATE LIFE CLINIC (www.ultimatelifeclinic.com) in March, 2020. Their
treatment totally reversed the virus. I did another blood test after
the 6 months long treatment and tested negative to the virus. Amazing
treatment! This treatment is a breakthrough for all HBV carriers.