Pragmatic Works

Enabling your business intelligence enterprise.
Welcome to Pragmatic Works Sign in | Join | Help
in Search

Brian Knight

February 2008 - Posts

  • Customer Labs: SQL Server Data Mining Bagging

    Recently I had a client who was receiving very good performance out of the decision tree algorithm on the top cases but the predictions in the middle of the cases left room for improvement. We tried each of the other algorithms like Logical Regression and Naive Baines but each had its own flaws. In all cases, the customer loved parts of each of the four primary SQL Server data mining algorithms but each left some key down at the bottom of the pile or promoted cases too aggressively.

    Enter a concept called data mining bagging. With data mining bagging, you can take all four algorithms that give you reasonable performance and combine their score into a more level prediction. In this customer’s case, we choose the top four algorithms for their data: Decision Tree, Clustering, Naïve Baines and Logical Regression. We were trying to predict the probability of a salesperson wanting to chase a given customer based on the odds of them getting the project. All the salespeople voted whether this was a good customer or not and these votes were all introduced as evidence.

    So the end game was this. We took all four algorithms and ran their respective queries in SSIS, loading a table for each algorithm of the customer and probability of it being a good customer (score). Then, we bagged the results in SSIS by taking the four tables and averaging their score. The result improved the already splendid results from data mining by 40%.

    By bagging your data, you ensure that one week model doesn’t inappropriately grade a case. It ensures that each algorithm essentially votes for each case and the more models that vote for that case, truly pushes it up. It also helps you get past the weakness of each model inherently will have. After seeing these results in SQL Server Analysis Services, I would highly recommend using bagging on any future projects.

    -- Brian Knight
  • SSIS Class Free Bonus Day

     In case you're looking for SSIS training in the immediately future, I have an upcoming class that we extended to 5 days for the same price. The fifth day includes training in SQL Server 2008 SSIS concepts and additional performance tuning on SSIS. It's our most complete SSIS class to date and there are only a few seats left if you can make it. 

    http://www.endtoendtraining.com/public/classes/coursedetails.aspx?courseid=19

    The upcoming class is in our Orlando training center. The training center is top notch with every seat a leather high-back seat and lunches served each night. On one night, we even take students to CheeseCake Factory and have a massage theropist come in one day to take care of class stress.

    Brian Knight 

     

    Posted Feb 23 2008, 08:13 PM by Brian Knight with no comments
    Filed under:
  • SQL Saturday Registration and Call for Speakers Open!

    We've now locked in on a venue and registration is now open for SQL Saturday #3 in Jacksonville. SQL Saturday is a free all day data related event for database professionals and developers who use data. The event is 5/3/8 at University of North FLorida. Space is limitted to the first 400 registrants and each of the other two event in Orlando and Tampa filled up to capacity. You can reserve your spot here:http://www.sqlsaturday.com/register.aspx

     We're also looking for sessions. You can see what sessions have been submitted, suggested or suggest a session here: http://www.sqlsaturday.com/possiblesessions.aspx 

    You can also submit a session if you're interested in speaking here: http://www.sqlsaturday.com/callforspeakers.aspx

    I hope to see you there and register soon to hold your space!

    Brian

     

  • Speaking at the Richmond SQL Server Users Group

     I'll be at the Richmond SQL Server Users Group next week if you're near the area.

    Brian Knight, SQL Server MVP
    presents Introduction to Data Mining

    Wednesday, February 27, 2008, 6:30 PM - 8:00 PM
    Social Time starts at 6:00 PM!
    Location: Markel Plaza
    4600 Cox Road
    Glen Allen, VA 23060 [map]

    Registration is not required for this event but if you plan to attend, please register so we'll know how many developers and database professionals to expect (feed). Thanks!

     

  • SSIS Connection Manager Data Types Mapping to SQL Server Data Types

    A Flat File Connection Manager will initially treat each column as a 50 character string by default. Leaving this default behavior will harm you when you go to an integer column into SQL Server or if your column contains more data than 50 characters. This Advanced page in the connection manager is the most important work you can do to ensure that all the data types for the columns are properly defined. You will also want to keep the data types as small as possible. If you have a zip code column for example that’s only 9 digits in length, define it as a 9 character string. Doing this will save an additional 41 bytes in memory multiplied by however many rows you have.

    A frustrating point with SSIS sometimes is how it deals with SQL Server data types. For example, a varchar maps in SSIS to a string column. It was made this way to translate well into the .Net development world and to make an agnostic product. The below table contains some of the common SQL Server data types and what they’re going to map to in a Flat File Connection Manager.

    SQL Server Data Type Flat File Connection Manager Data Type

    bigint

    eight-byte signed integer [DT_I8]
    binary byte stream [DT_BYTES]
    Bit Boolean [DT_BOOL]
    Date single-byte unsigned integer [DT_UI1]
    Datetime database timestamp [DT_DBTIMESTAMP]
    Decimal numeric [DT_NUMERIC]
    Float float [DT_R4]
    Int four-byte signed integer [DT_I4]
    Image image [DT_IMAGE]
    Nvarchar or nchar Unicode string [DT_WSTR]
    ntext

    Unicode text stream [DT_NTEXT]

    Numeric numeric [DT_NUMERIC]
    Smallint two-byte signed integer [DT_I2]
    Text text stream [DT_TEXT]
    Timestamp byte stream [DT_BYTES]
    Tinytint single-byte unsigned integer [DT_UI1]
    Uniqueidentifier unique identifier [DT_GUID]
    Varbinary byte stream [DT_BYTES]
    Varchar or char string [DT_STR]
    Xml Unicode string [DT_WSTR]
     
  • SQL Saturday Tampa

    I just got home from SQL Saturday Tampa. Bravo to the organizers (Wes and Pam specifically) and for all the fantastic volunteers. For those who came to my presentation on performance tuning SSIS, you can download the slide deck here:

    http://pragmaticworks.com/community/files/folders/bipresentations/entry37.aspx

    Thanks again for those who attending the sessions!

    Brian Knight

  • SQL Saturday Jacksonville Announced

    We're happy to finally have a date for SQL Saturday, Jacksonville. SQL Saturday is an all day free event dedicated to SQL Server technology for developers and DBAs. We'll be doing the event on 5/3/2008 at the downtown Jacksonville library. Like the Orlando event, we'll have room to support up to 400 people and lunch will be provided. The upcoming schedule for the event is as follows:

    • March 1 - Call for speakers opens
    • March 15 - Registration opens 

     You can see more information about the event soon at http://www.sqlsaturday.com.

     

     

     

  • Expression to Format Dates to Two Digits

    When you're creating a flat file, often times you'd like to place a date stamp at the end of the file name in order to show when the file was created simply when sharing the file with a partner. The problem comes in though where you wish to have a two digit date. If you were to create an expression that resembles the following code: 

    "C:\\Extract" +  SUBSTRING((DT_WSTR, 30)  GETDATE(),1,10) + ".csv"

    The leading zeros in one digit date (ie January with a 01) may yield you the results: C:\Extract2008-2-5.csv (results may vary based on your locale). Your partners may be a frustrated with you when sometimes you send them files with 6 digit dates and other times 8 digit dates since they're trying to automate their processes as well.

    The alternative is to first cast the getdate() function into a DT_DBDATE, which only holds the date (stripping the time) and then cast it into a string thereafter.

    "C:\\Extract" +  (DT_WSTR, 10) (DT_DBDATE) GETDATE() + ".csv"

    This will product a predictable format for your flat files of the following:

     C:\Extract2008-02-05.csv

    -- Brian Knight

More Posts
Powered by Community Server (Non-Commercial Edition), by Telligent Systems