In the fall of 2024, I had the opportunity to work with Matt Eland and be one of the editors for his book “Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel”. Matt is a very intelligent and knowledgeable data science developer and it definitely reflected in his work. He walks the reader through step-by-step directions to demonstrate key concepts in data science, machine learning, as well as polyglot notebooks. This was one of the rare books that I could hardly put down. I urge you to pick-up a copy and upgrade your data science skills.
Sam Nasr
Friday, January 10, 2025
"Data Science with .NET and Polyglot Notebooks" By Matt Eland
Thursday, January 9, 2025
Recap of "How To Tune A Multi-Terabyte Database For Optimum Performance"
On October 29, 2024 at the GroupBy conference, I was moderator for Jeff Taylor's session "How To Tune A Multi-Terabyte Database For Optimum Performance"
The video is available at https://www.youtube.com/watch?v=9j51bD0DPZE
Listed below are some take aways and Q&As from his session:
Ideal Latency time:
20ms for IO
10ms for TempDB
Crystal Disk Mark is a simple disk benchmark software: https://crystalmark.info/en/software/crystaldiskmark/
What is the overhead of running these diagnostics (i.e. diskspd and Crystal Disk)?
No adverse effects during mid-day testing, but don't run it during a busy time.
It's best to test it during both busy and non-busy times
Mutlipath: multiple network cards between host, switch and SAN appliance
For tempdb storage, what's preferable?
Shared space on a disk pool with a lot of drives or dedicated pool with just 2 drives (raid 1)? all drives of the same type (NVMe).
"Shared" means with other databases
Run in memory for newer versions
Raid10 will be fastest
keep tempDB separate from other DBs
By in memory tempdb, does that mean memory optimized tempdb metadata option? Yes
Jumbo Frames are 8192 when enabled, should be used for storage network to avoid issues
to transfer more across the network
Raid 5 is best for economy/performance combination on both SSD and conventional drives
RAID 5 for data? What about that write penalty overhead? Why not RAID 10 ? RAID 10 is best but RAID 5 will sufficiently perform but make sure you have enough memory for operational needs
Use New Mbsv3 series VMs from Azure
Would you consider local raid0 for tempdb? Yes, you can but RAID 1 so it's redundant so it stays live.
nvarchar: N for "National" characters for various foreign languages, 2+ bytes per character
Prefer to use INT instead of BigInt
Datetime2 (8 bytes) is preferred over Datetime(6-8 bytes)
Unicode size on disk is one thing, size when in cache is a worse problem
What do you think about using IFI (Instant File Initialization) for log file in 2022? Recommended
Avoid Heap tables. However, Markus Winand, author or SQL Performance Explained, shows some specialized cases where Heap is better.
See https://medium.com/@nhatcuong/sql-performance-explained-by-markus-winand-some-notes-after-the-first-read-1dde208f2fd7 for more info
Friday, December 13, 2024
Temporal Tables FAQ
I had the pleasure of presenting Temporal Tables to the Capital Area .NET User Group on December 10, 2024. Some interesting FAQ arose from that meeting so I thought it would be good to share it on my blog for reference.
- Is the historical table logged to the Transaction log in the same way a conventional table is? Will we “see” the inserts in the Tran log the same way we see them for a normal table?
No
- Are Temporal Tables available in Azure?
Yes, in Azure SQL Database and Azure SQL Managed Instance
- Will Temporal Tables work with graph tables?
No, Node and edge tables can't be created as system-versioned temporal tables
Ref: https://learn.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-architecture?view=sql-server-ver16
- What triggers the purge of SQL Server Temporal Tables?
A background task is created to perform aged data cleanup for all temporal tables with finite retention period.
ref: https://learn.microsoft.com/en-us/azure/azure-sql/database/temporal-tables-retention-policy?view=azuresql
- Can I modify recs in the history table with versioning OFF?
Yes
- Can I alter the current table with versioning ON?
Yes, alters both tables simultaneously
- Does EF Code First Support Temporal Tables (TT)?
Yes, EF Core 6.0 and later supports:
- The creation of temporal tables using Migrations
- Transformation of existing tables into temporal tables, again using Migrations
- Querying historical data
- Restoring data from some point in the past
- When are Temporal Tables NOT a good fit?
For Static data where a user can't change a field.
The following industries rarely, if ever, are users allowed to delete any data
- Financial services (i.e. stock purchases, banking)
- Purchases
- GPS Location Tracking
- Fraud Detection(Location of purchases, IP/location of login)
- Tracking appointments (Resolve Customer service issues)
- Car Rental company, tracking owner and mileage when vehicle was in possession
- Hotel room rental
- Airline seat charts, who sat where in which flight
- Are there other approaches are available for managing historical data in the temporal history table?
Yes, The following four methods are available:
- Stretch Database
- Table Partitioning
- Custom Cleanup Script
- Retention Policy
- Can Temporal tables be used with partitioned databases?
Yes, with some limitations. If current table is partitioned, the history table is created on default file group because partitioning configuration is not replicated automatically from the current table to the history table.
- Can the temporal table be placed in another DB?
No, History table must be created in the same database as the current table.
However, it can be placed in a different schema within the same database.
Also, Temporal querying over Linked Server is not supported.
- Can it be used with Elastic DBs?
Yes
- Can I alter table schema with sys versioning = on?
No
- Can I add additional fields to History table?
No, fields and field nullability must be identical
- Is there automatic truncation of history table?
Yes, using HISTORY_RETENTION_PERIOD = 6 MONTHS
Tuesday, July 23, 2024
Questions on Copilot Data Privacy
Q: concerned about what Microsoft and / or the US government can do with the data for a custom copilot. I’ve looked at the Microsoft copilot documentation but I didn’t find anything that clearly states what Microsoft can and cannot do with data used in custom copilots, do you have any resources that you can share?
A: Microsoft posted info about this topic specifically at https://learn.microsoft.com/en-us/legal/cognitive-services/openai/data-privacy?context=%2Fazure%2Fcognitive-services%2Fopenai%2Fcontext%2Fcontext#see-also
In a nutshell:
Your prompts (inputs) and completions (outputs), your embeddings, and your training data:
- are NOT available to other customers.
- are NOT available to OpenAI.
- are NOT used to improve OpenAI models.
- are NOT used to improve any Microsoft or 3rd party products or services.
- are NOT used for automatically improving Azure OpenAI models for your use in your resource (The models are stateless, unless you explicitly fine-tune models with your training data).
- Your fine-tuned Azure OpenAI models are available exclusively for your use.
The Azure OpenAI Service is fully controlled by Microsoft; Microsoft hosts the OpenAI models in Microsoft’s Azure environment and the Service does NOT interact with any services operated by OpenAI (e.g. ChatGPT, or the OpenAI API).
Q: Does Microsoft have the same Data Privacy policy for Copilot studio as Azure AI Studio? Is there similar documentation for custom copilots created in copilot studio?
A: It’s seeming like it’s the same. After logging into Copilot Studio, browse to https://www.microsoft.com/licensing/terms/product/PrivacyandSecurityTerms/all . From there, you can download the Data Protection Addendum from https://aka.ms/DPA. (see attached). On P.5 it states:
Nature of Data Processing; Ownership
Microsoft will use and otherwise process Customer Data, Professional Services Data, and Personal Data only as described and subject to the limitations provided below (a) to provide Customer the Products and Services in accordance with Customer’s documented instructions and (b) for business operations incident to providing the Products and Services to Customer. As between the parties, Customer retains all right, title and interest in and to Customer Data and Professional Services Data. Microsoft acquires no rights in Customer Data or Professional Services Data, other than the rights Customer grants to Microsoft in this section. This paragraph does not affect Microsoft’s rights in software or services Microsoft licenses to Customer.
Thursday, June 6, 2024
'Build Your Own Copilot" Resource Links
Interested in building your own Copilot using Azure AI Studio? Listed below are some useful links:
- Azure AI Studio Architecture
- Quickstart: Create a project and use the chat playground in Azure AI Studio
- Tutorial: Deploy an Enterprise Chat web app
- AI Studio FAQ
Thursday, April 25, 2024
May '24 Regional Tech Events
User Groups
- May 2: Twin Cities .NET User Group
- May 7: Ohio North Database Training
- May 8: Azure Cleveland
- May 16: GLUG.NET
- May 22: Cleveland C# User Group
Conferences
- May 3: Stir Trek
- Jun 27-28: Kansas City Developer Conference
- Jul 26: Cincy Deliver
Wednesday, March 27, 2024
Apr '24 Regional Tech Events
User Groups
- Apr 2: Ohio North Database Training
- Apr 4: Twin Cities .NET User Group
- Apr 10: Azure Cleveland
- Apr 18: GLUG.NET
- Apr 24: Cleveland C# User Group
Conferences
- May 3: Stir Trek
Trainers in ML.NET
Machine learning tasks like regression and classification contain various algorithm implementations.
Some tasks may utilize the same algorithm, such as the SDCA algorithm in both Binary Classification and Regression tasks
In some cases, the problem you are trying to solve and the way your data is structured does not fit well into the current algorithm.
If so, consider using a different algorithm for your task to see if it learns better from your data.
A trainer identifies a single algorithm used for a single task (i.e. Trainer = Algorithm + Task).
Listed below is a summary of trainers available in ML.NET. For more info, see guidance on which algorithm to choose.
Trainer | Algorithm | Task | ONNX Exportable |
Binary classification | Yes | ||
Binary classification | Yes | ||
Multiclass classification | Yes | ||
Multiclass classification | Yes | ||
Regression | Yes | ||
Averaged Perceptron | Binary classification | Yes | |
Binary classification | Yes | ||
Multiclass classification | Yes | ||
Regression | Yes | ||
Symbolic stochastic gradient descent | Binary classification | Yes | |
Online gradient descent | Regression | Yes | |
Light gradient boosted machine | Binary classification | Yes | |
Light gradient boosted machine | Multiclass classification | Yes | |
Light gradient boosted machine | Regression | Yes | |
Light gradient boosted machine | Ranking | No | |
Fast Tree | Binary classification | Yes | |
Fast Tree | Regression | Yes | |
Fast Tree | Regression | Yes | |
Fast Tree | Ranking | No | |
Fast Forest | Binary classification | Yes | |
Fast Forest | Regression | Yes | |
Generalized additive model | Binary classification | No | |
Generalized Additive Model | Regression | No | |
Matrix Factorization | Recommendation | No | |
Field Aware Factorization Machine | Binary classification | No | |
One Versus All | Multiclass classification | Yes | |
Pairwise Coupling | Multiclass classification | No | |
KMeans | Clustering | Yes | |
Randomized Pca | Anomaly detection | No | |
Naive Bayes Multiclass | Multiclass classification | Yes | |
Prior | Binary classification | Yes | |
Linear Svm | Binary classification | Yes | |
Ld Svm | Binary classification | Yes | |
Ols | Regression | Yes |
Tuesday, March 12, 2024
What is Auto-GPT?
Auto-GPT is an experimental project developed by Significant Gravitas. It’s an open-source Python application powered by GPT-4.
Unlike ChatGPT, Auto-GPT does not rely on human prompts to operate. It can self-prompt and tackle subsets of a problem without human intervention. It works by pairing GPT with AI agents that can make decisions and take actions based on a set of rules and predefined goals.
Auto-GPT is important and relevant because it showcases the potential of language models like GPT-4 to autonomously complete different types of tasks. It has the ability to write and execute its own code using GPT-4, allowing it to debug, develop, and self-improve recursively. One of the advantages of Auto-GPT is its ability to continuously self-improve. It can debug, develop, and enhance its own capabilities recursively.
Accessing Auto-GPT requires specific installed software and familiarity with Python, and an API key from OpenAI. It runs locally on a Mac, PC, or Docker image.
For a complete tutorial on how to use AutoGPT, visit https://youtu.be/v-5AWQlTFw8
For more info, see What is Auto-GPT and What Is the Difference Between ChatGPT vs Auto-GPT?
Thursday, March 7, 2024
ML.NET Task Metrics
ML.Net has the capability of utilizing 7 different Machine Learning Tasks via the MLContext object:
- Binary Classification
- Multi-class/text Classification
- Regression and Recommendation
- Clustering
- Ranking
- Anomaly Detection
- sentence similarity
Each task offers various performance metrics for evaluating the model after training is completed
These metrics are properties accessible via the Evaluate() method within each task object (i.e. MLContext.MLTask.Evaluate()
Sample Code Snippet
static void Main(string[] args)
{
MLContext mlContext = new MLContext();
// 1a. Create training data
HouseData[] houseData = {
new HouseData() { Size = 1.1F, Price = 1.2F },
new HouseData() { Size = 1.9F, Price = 2.3F },
new HouseData() { Size = 2.8F, Price = 3.0F },
new HouseData() { Size = 3.4F, Price = 3.7F } };
// 1b. Import training data
IDataView trainingData = mlContext.Data.LoadFromEnumerable(houseData);
// 2. Specify data preparation and model training pipeline
var pipeline = mlContext.Transforms.Concatenate("Features", new[] { "Size" })
.Append(mlContext.Regression.Trainers.Sdca(labelColumnName: "Price", maximumNumberOfIterations: 100));
// 3. Train model
var model = pipeline.Fit(trainingData);
//***** Model Evaluation
HouseData[] testHouseData =
{
new HouseData() { Size = 1.1F, Price = 0.98F },
new HouseData() { Size = 1.9F, Price = 2.1F },
new HouseData() { Size = 2.8F, Price = 2.9F },
new HouseData() { Size = 3.4F, Price = 3.6F }
};
var testHouseDataView = mlContext.Data.LoadFromEnumerable(testHouseData);
var testPriceDataView = model.Transform(testHouseDataView);
var metrics = mlContext.Regression.Evaluate(testPriceDataView, labelColumnName: "Price");
double rs = metrics.RSquared;
double rmse = metrics.RootMeanSquaredError;
}
Metrics Summary
Listed below is a summary of 6 various ML.NET Tasks and their metrics:
BinaryClassification | MulticlassClassification | Regression |
MAE (Mean Absolute Error) | ||
AreaUnderPrecisionRecallCurve | Log-Loss | MSE (Mean Squared Error) |
RMSE (Root Mean Square Error) |
Clustering | Ranking | AnomalyDetection |
Avg Distance | DCG | Area Under ROC Curve |
Davies Boulding Index | Normalized DCG | Detection Rate At False Positive Count |
NMI | | |
Reference: https://learn.microsoft.com/en-us/dotnet/machine-learning/resources/metrics