Sunteți pe pagina 1din 8

Data Services code migration & maintenance: simplify your lifecycle

& improve your code


For so many of my customers, SAP Data Services is a relatively new tool. A typical DS project is mainly focused on ensuring the solution
works for the business and is launched on time. Unfortunately, many of these projects fail to utilize some of the built-in features of Data
Services to help simplify how code is managed in the solution. This is an architecture gap that adds hidden costs to owning and
operating the solution.

In this article, I outline the framework for managing Data Services code that I have taught to dozens of the largest customers in the
Americas. Ideally, you should implement this during the blueprint phase so that you provide your developers with the tools and
processes to create better code the first time. However, you can still benefit from this framework even if you are approaching the testing
phase of a Go-Live.

The elements of this framework include:

1. Implement multiple central repositories


In Data Services, a "central repository" is a different type of repository used only for version control of objects. This is comparable to
version control systems like CTS+, Visual SourceSafe, Apache Subversion, BusinessObjects LCM, etc. You can check in new code,
check it out to work on it, check in new versions, and get copies of specific versions.

Many customers do not use central repositories. Instead, they create their code in a local repository, export the code to an ".atl" file, and
import it into the test or production local repository. You can save backups of the .atl file and keep track of them in a number of
ways...even Apple Time Machine and Dropbox can keep track of multiple versions of a file through time. However, this is likely not a
scalable or trustworthy solution for enterprise IT.

If you want to learn how to work with a central repository, the Data Services Tutorial Chapter 12 "Multi-user Development" does a
fantastic job at demonstrating all the techniques. The "Using a Central Repo" Wiki Page also captures some of the basic techniques. But
neither will tell you why, or discuss how you should set up your landscape or processes.

[Note: There are two different types of central repositories: non-secure and secure. Secure central repositories allow only specific users
permissions on specific objects and provide an audit trail of who changed which objects. Non-secure central repositories lack these
features. Due to this gap, I never recommend the use of non-secure central repositories. In this article, whenever I refer to a central
repository, I am talking about secure central repositories. Chapters 23-24 in the Data Services Designer Guide discuss these
differences.]

This is how I recommend for you to configure up your secure central repositories.

 Development Central – a central repository that can be accessed by developers and testers. Developers create content in their local
repositories and check in this content into the development central repository. Each logical set of code should be checked in with the
same label (e.g. "0.1.2.1a") so that they can be easily identified and grouped together.

During a test cycle, a tester logs into a local repository dedicated to testing and connects to the development central repository. The
tester gets all objects to be tested from the development central repository. The tester deactivates the connection to the development
central repository and then connects to the test central repository.

 Test Central – a central repository that can be accessed by testers and production administrators. During the test cycle, testers check
in development objects before and after testing, labeling them appropriately (e.g. "0.1.2.1pretest" and "0.1.2.1passed"). Thus, the test
central repository contains only objects that have been promoted from development to test and have passed testing.

 Production Central – a central repository that can be accessed only by production administrators. When testers certify that the code
can be migrated to production, a production administrator logs into a production local repository. The administrator activates a
connection to the test central repository and gets a copy of all objects to be promoted to production (e.g. "0.1.2.1passed"). The
administrator deactivates the test central repository and then activates the production central repository. All objects that were promoted
into production are then checked into the production central repository (e.g. "0.1.2.1prod"). Thus, the production central repository
contains only objects that have been successfully put into production.

Remember, central repositories are only for version control, storing your code, and helping you migrate it. You never run batch jobs or
launch real-time services from a central repo -- only from a local repo.

This tiered approach plan looks like this:


The repositories themselves are just database schemas -- you can put them in any supported database. Check the Product Availability
Matrix for Data Services to see which databases are supported. However, I would recommend for you to group them together within the
same physical database within your specific tier. For instance:

 Dev database -- dev local repositories, dev central repository, and dev CMS database. Co-located with the dev Data Services hardware.
 Test database -- test local repository and test central repository, and test CMS database. Co-located with test Data Services hardware.
 Prod database -- prod local repository and prod central repository, and prod CMS database. Co-located with prod Data Services
hardware.

1.1 Additional best practices for central repositories


 Security -- Set up group-based permissions for repository authentication and for individual objects. Refer to the Designer Guide section
24.1.1, Management Console Guide section 3.3.1, and Administrator's Guide section 4.1.

 Checking out datastores -- Using the security features of secure central repositories, make sure that only specific groups have
read+write permissions on datastores. Everyone always has permissions to edit datastores in their local repository, but it would be
disorganized to let all of them check in these datastore changes to the central repository. Thus, you should have administrators create
your datastores and check them into your local repository. Anyone can get them from the central repo but only administrators have
permissions to check them out, modify them, and check in their changes. For more info on defining datastores, see below "3. Defining
datastore configurations".

 Backup -- These repositories contain most of your investment in your DS solution! Make sure to back up these databases regularly as
you would with any other database. Too often I see no backups taken on the development central repository because "it's not part of the
productive tier." This is a terrible business decision! What if your development central repository database crashes and your developers
lose everything?

 Designer performance -- DS Designer requires a database connection to the local and central repositories. I always meet people who
complain about Designer being too slow. Okay, but you are using Designer on your laptop in the Toronto airport from a VPN connection
to your Seattle network hub and the repo database is in your Chicago datacenter. Designer performs numerous small transactions that
each require network round-trips -- if the connection is slow, Designer is going to be slow to save anything to your local repository or
interact with a central repository.

Are you regularly using a thick-client Windows app like Designer from remote locations? Maybe you should think about putting Designer
on Citrix Presentation Server -- check the Installation Guide for Windows section 6.6. Additionally, Designer 4.1 introduces the ability to
use DS under multiple Windows Terminal Server users.
 Concurrent usage -- I often hear issues about developers connected to the same central repo who have their Designer hang up on
them whenever their colleagues do anything ("Get Latest Version", "Check-Out by Label", etc.). To protect the code from being corrupted
by multiple people trying to do multiple things at the same time, Designer performs table locking on certain central repo tables. While one
user has an exclusive table lock on a central repo table, any other users trying to interact with the same table will be queued until the first
user's exclusive table lock is released. How to work around this? Simple -- don't keep your connection to the central repo active all the
time. There's a Designer option that allows you to activate a central repo connection automatically, and you should disable this option.
Only activate your central repo connection when you need to get code from or check code into the central repo.

2. Define substitution parameters & multiple substitution parameter


configurations
Substitution Parameters are such a handy feature, but I seldom see them used to their full potential! If you know C++, they are similar to
compiler directives. They are static values that never change during code execution (so we don't call them variables). They are called
"substitution" parameters because their values get substituted into the code by the optimizer when you run the job. They can thus
change the run-time behavior of your code.

Often I see many programmers use a script block at the beginning of a job to set global variable values. These global variables are then
used to control the logic or mappings later in the job. However, in 90% of these cases the global variables NEVER CHANGE during
runtime. So now you have several problems in your code:

 You hid your global variable declarations in a script somewhere in your job. How do you expect other people to understand what you did
in your code?
 A global variable is specific to one job only. Other jobs do not inherit global variable names, types, or values. So if you have 100 jobs that
use a variable named $START_DATE, you have to declare $START_DATE in every one of those 100 jobs.
 Global variables have no way of being set quickly en masse. You can override them individually at run-time, but this introduces the risk
of human error.

Substitution parameters fix all of these global variable short-comings. They are defined for an entire repository, not per individual job.
Their values are controlled at a repository level, so you don't have to include scripts to set them. They cannot change through run-time,
so they don't have the risk of being modified erroneously. Lastly, they don't just have one default value -- you can set up multiple
substitution parameter configurations for your repository so that you have multiple different sets of run-time values.

Here are some common uses for substitution parameters:

 File paths and file names -- tell jobs where to find files in a specific staging area or target location. If you always set flat file and XML file
sources and targets to use substitution parameters instead of hard-coded paths, you can change all file locations at once globally
instead of having to find every single object, drill into it, and change the path. This is also used to specify reference data locations.
 Control logic -- tell the same job how to run differently if a different substitution parameter value is found. You can use this to set up one
job that does both initial loading and delta loading. You can have a conditional block evaluate a parameter named [$$IS_DELTA] and
decide whether to process the "delta" workflow or the "initial" workflow. This lets you have fewer jobs and simplifies your life!
 Transform options -- tell transforms to behave in a specific way. This is often used in Data Quality transform options to set country-
specific options, engine options, performance parameters, or rules. However, you can use them in most of the transforms and mappings
to override hard-coded values with your own substitution parameters.

Substitution Parameter Configurations are helpful because they let you set multiple different sets of substitution parameters. You can
use this to set up multiple configurations for:

 Dev / Test / Prod


 Initial vs. Delta
 Enabling verbose debug code in your own script blocks or custom functions
 Specifying multiple file paths, e.g. fileshares in Chicago, L.A., Shanghai, Wrocław, and San Leopoldo.

Substitution Parameters are not objects that can be checked into a central repository, since they aren't actually code objects. As such,
there is a specific way to move them between local repositories. You must export them to an ATL file and import them into another local
repository. Please refer to the example below:
This is an additional step to include in your migration plans from Dev -> Test -> Production. However, it is relatively quick procedure for
an administrator.

3. Define datastore configurations


Datastore mistake 1: In many customer environments, I log into a local repository and see several datastores named similarly
("HANA_TARGET_DEV", "HANA_TARGET_QA", and "HANA_TARGET_PROD"). Or maybe I see many SAP datastores named after
their SIDs ("BWD", "BWQ", "BWP). If you make this mistake, you need to go through the following unnecessary steps:

 If you move a job from development to test, you have to edit every single dataflow and delete every single table object, replacing the
table objects from datastore "HANA_TARGET_DEV" with the ones from "HANA_TARGET_QA".
 This increases the risk of human error -- what if you pick the wrong table by mistake?
 This increases the number of table objects to maintain -- you have to import the same table object 3 different times, one from each
different datastore.
 You risk having differences in the table metadata from the different development/test/production datastores. Don't you want to ensure
that the code is always the same?

Datastore mistake 2: Since this gets to be so time-consuming, many developers realize that they can just reuse one datastore from dev
to test to production. So you see a datastore named "HANA_TARGET_DEV" or "BWD" in a production local repository. In this case, the
administrators just explain how they change the hostname, username, and password of the datastore when they move it to test or
production. Though this sounds simple, you still run the risk that you must change more than just username/password. In the case of an
SAP ECC source datastore, are the transport file paths the same between your different ECC sources?

The solution to both of these mistakes? Datastore configurations.

Datastore configurations are very powerful. They allow you to have a single datastore that can connect to multiple different sources.
They work very similar to substitution parameter configurations: at run-time, the optimizer selects a single configuration, and this
connection information is used for the entire execution of the job and cannot be modified. You set them up in the datastore editor...the
Data Services Wiki shows a good example.
I would strongly urge you to avoid the two mistakes above by starting your project with the following principles:

Give datastores meaningful names that describe their data domain. Do NOT name them after a specific tier (dev/test/prod) or a specific
region (AMER/EMEA/APJ) or a specific database ("DB2", "HANA", "SYBASEIQ") or a specific SAP SID (ECD/ECQ/ECP). Just name
them after their data: "SALES", "VENDOR", "MATERIALS", "VERTEX", "BANKING". This is important because you cannot rename a
datastore once it is defined.

1. Set up multiple datastore configurations inside of each datastore. Multiple datastore configurations should be used when the same
metadata exists in multiple systems. If the metadata is different between two systems, they belong in separate datastores.
2. If you have Dev/Test/Prod tiers, make sure to set up separate datastore configurations for Dev/Test/Prod in your development local
repositories. No, you don't have to know the correct usernames/passwords for the test or production systems (and in fact, this would be
a serious risk!). Get them set up anyway! When testers and production administrators go into production, the only thing they will need to
change will be the username and password. This helps avoid the risk of human error during a critical Go-Live.

For advanced users, you can even use datastore configurations to move from one database platform to another without having to re-
develop all your code.

3.1 Use aliases to map table owners (optional)


If you are using database sources or targets, these tables always have an owner name or schema name (e.g. "SCOTT"."TIGER"). In the
Data Services Designer interface, these owner names exist but are not usually very obvious to the user.

This is usually a problem that manifests itself when you migrate from Dev to Test or Test to Production. Let's say you developed your
dataflow and used a source table named "SQLDEVUSR1"."EMPLOYEES". The username "SQLDEVUSR1" is the table owner. You also
set up a second datastore configuration for the Test environment, and the username is "SQLTESTUSR5". When you run the job and set
the Test datastore to be default, the job crashes at this dataflow with a "TABLE NOT FOUND" error. Why? It connected to the database
specified in the Test datastore configuration as username "SQLTESTUSR5" and tried to find a table named
"SQLDEVUSR1"."EMPLOYEES". This is a design problem, not a Data Services error.

Instead, you need to tell Data Services how to interpret the name "SQLDEVUSR1" differently depending on which datastore
configuration is active. There is a feature called "Aliases" in each database datastore that lets you control this!

You can create one or more aliases in each database datastore to automatically change the table owner name defined in the dataflow
with the table owner name of your choice. At runtime, the optimizer does a search and replace through the code for any objects from that
datastore and maps an alias named 'SQLDEVUSR1" to be "SQLTESTUSER5".

Here's another example:

This is a little-known feature, but it saves you a ton of time if you have many developers who connected to various sandbox databases
when developing the code. You can simply set up multiple aliases to search for various (and possibly incorrect) owner names and map
them to what their "real" owner names should be within your official Dev/Test/Production datastore configurations.

4. Define system configurations to map together combinations of


substitution parameters & datastore configurations
At this point, you have done the following:

 Created substitution parameters


 Created multiple substitution parameter configurations to control various aspects of run-time behavior
 Created datastores
 Created multiple datastore configurations to connect to different sources of data that have identical metadata

Your setup might look like this:


The final step is to create system configurations. These are combinations of datastore configurations and substitution parameters that
let you set up job execution profiles that can be quickly and easily set at run-time. The optimizer then chooses only that combination of
configurations for the execution of the entire job. If you have never defined a system configuration in a specific local repository, you will
never see it as a drop-down option when you try to run a job. However, after you configure system configurations, you will now see a
convenient drop-down box that shows the names of your various system configurations:

If we use the example above with the 3 datastores with 3 different configurations and the 6 different substitution parameter
configurations, you can now create system configurations as combinations of these. Here is how you might set up your system
configurations:

After this, when you run or schedule a job, you would see a drop-down with your 6 different system configuration names:

 DevInit
 DevDelta
 TestInit
 TestDelta
 ProdInit
 ProdDelta

To be honest, this isn't a very good example. Why would you want your production local repository to have the ability to easily execute
jobs in your Dev landscape? Thus, you would probably want to set up system configurations that specifically correspond to the
substitution parameter configurations and datastore configurations that you really want to use when you actually run the jobs. So in this
example you would probably want to set up your production local repository system configurations to only include "ProdInit" and
"ProdDelta" so you never make the mistake of selecting one of the Dev or Test configs.

What if you don't select a system configuration at run-time? Each datastore has a "default" datastore configuration. Likewise, there is
also a default substitution parameter configuration. If no substitution parameter configuration is selected, the optimizer selects the default
datastore configuration for each datastore and the default substitution parameter configuration for that repository.

Similar to substitution parameter configurations, system configurations cannot be checked into a central repository. They can be
migrated in the same way you saw above with exporting substitution parameters to an ATL file. However, this is probably not necessary -
- system configurations are very quick to define, and you probably only want to create them for the environment that you want to run in
(e.g. "ProdInit" and "ProdDelta").

S-ar putea să vă placă și