Sunteți pe pagina 1din 8

In computer programming, a string is traditionally a sequence of characters,

either as a literal constant or as some kind of variable. The latter may allow
its elements to be mutated and the length changed, or it may be fxed (after
creation. ! string is generally understood as a data type and is often
implemented as an array of bytes (or words that stores a sequence of
elements, typically characters, using some character encoding. ! string may
also denote more general arrays or other sequence (or list data types and
structures.
"epending on programming language and precise data type used, a variable
declared to be a string may either cause storage in memory to be statically
allocated for a predetermi#et $ be a non%empty finite set of symbols
(alternatively called characters, called the alphabet. &o assumption is made
about the nature of the symbols. ! string (or word over $ is any finite
sequence of symbols from $.'() *or example, if $ + ,-, (., then -(-(( is a
string over $.
The length of a string s is the number of symbols in s (the length of the
sequence and can be any non%negative integer/ it is often denoted as 0s0.
The empty string is the unique string over $ of length -, and is denoted 1 or
2.'()'3)
The set of all strings over $ of length n is denoted $n. *or example, if $ + ,-,
(., then $3 + ,--, -(, (-, ((.. &ote that $- + ,1. for any alphabet $.
The set of all strings over $ of any length is the 4leene closure of $ and is
denoted $5. In terms of $n,
67igma8,5. + 6bigcup9,n 6in 6mathbb,&. 6cup 6,-6.. 67igma8,n.
*or example, if $ + ,-, (., then $5 + ,1, -, (, --, -(, (-, ((, ---, --(, -(-,
-((, ..... !lthough the set $5 itself is countably infinite, each element of $5 is
a string of fnite length.
! set of strings over $ (i.e. any subset of $5 is called a formal language over
$. *or example, if $ + ,-, (., the set of strings with an even number of :eros,
,1, (, --, ((, --(, -(-, (--, (((, ----, --((, -(-(, -((-, (--(, (-(-, ((--,
((((, ...., is a formal language over $.
;oncatenation and substrings'edit)
;oncatenation is an important binary operation on $5. *or any two strings s
and t in $5, their concatenation is defined as the sequence of symbols in s
followed by the sequence of characters in t, and is denoted st. *or example, if
$ + ,a, b, ..., :., s + bear, and t + hug, then st + bearhug and ts + hugbear.
7tring concatenation is an associative, but non%commutative operation. The
empty string serves as the identity element/ for any string s, 1s + s1 + s.
Therefore, the set $5 and the concatenation operation form a monoid, the
free monoid generated by $. In addition, the length function defines a monoid
homomorphism from $5 to the non%negative integers (that is, a function #<
67igma8,5. 6mapsto 6mathbb,&. 6cup 6,-6., such that #(st+#(s
=#(t6quad 6forall s,t6in67igma85.
! string s is said to be a substring or factor of t if there exist (possibly empty
strings u and v such that t + usv. The relation >is a substring of> defnes a
partial order on $5, the least element of which is the empty string.
?refxes and su@xes'edit)
! string s is said to be a prefx of t if there exists a string u such that t + su. If
u is nonempty, s is said to be a proper prefx of t. 7ymmetrically, a string s is
said to be a su@x of t if there exists a string u such that t + us. If u is
nonempty, s is said to be a proper su@x of t. 7u@xes and prefxes are
substrings of t. Aoth the relations >is a prefx of> and >is a su@x of> are prefx
orders.
Botations'edit)
! string s + uv is said to be a rotation of t if t + vu. *or example, if $ + ,-, (.
the string --((--( is a rotation of -(--((-, where u + --((- and v + -(.
Beversal'edit)
The reverse of a string is a string with the same symbols but in reverse order.
*or example, if s + abc (where a, b, and c are symbols of the alphabet, then
the reverse of s is cba. ! string that is the reverse of itself (e.g., s + madam
is called a palindrome, which also includes the empty string and all strings of
length (.
#exicographical ordering'edit)
It is often useful to define an ordering on a set of strings. If the alphabet $
has a total order (cf. alphabetical order one can define a total order on $5
called lexicographical order. *or example, if $ + ,-, (. and - C (, then the
lexicographical order on $5 includes the relationships 1 C - C -- C --- C ...
C ---( C --( C -( C -(- C -(( C -((- C -(((( C ( C (- C (-- C (-( C
((( C (((( C ((((( ... The lexicographical order is total if the alphabetical
order is, but isnDt well%founded for any nontrivial alphabet, even if the
alphabetical order is.
7tring operations'edit)
! number of additional operations on strings commonly occur in the formal
theory. These are given in the article on string operations.
Topology'edit)
(Eypercube of binary strings of length F
7trings admit the following interpretation as nodes on a graph<
*ixed%length strings can be viewed as nodes on a hypercube
Gariable%length strings (of fnite length can be viewed as nodes on the k%ary
tree, where k is the number of symbols in $
Infnite strings (otherwise not considered here can be viewed as infnite
paths on the k%ary tree.
The natural topology on the set of fxed%length strings or variable length
strings is the discrete topology, but the natural topology on the set of infnite
strings is the limit topology, viewing the set of infnite strings as the inverse
limit of the sets of fnite strings. This is the construction used for the p%adic
numbers and some constructions of the ;antor set, and yields the same
topology.
Isomorphisms between string representations of topologies can be found by
normali:ing according to the lexicographically minimal string rotation.
7tring datatypes'edit)
7ee also< ;omparison of programming languages (string functions
! string datatype is a datatype modeled on the idea of a formal string.
7trings are such an important and useful datatype that they are implemented
in nearly every programming language. In some languages they are available
as primitive types and in others as composite types. The syntax of most high%
level programming languages allows for a string, usually quoted in some way,
to represent an instance of a string datatype/ such a meta%string is called a
literal or string literal.
7tring length'edit)
!lthough formal strings can have an arbitrary (but fnite length, the length of
strings in real languages is often constrained to an artifcial maximum. In
general, there are two types of string datatypes< fxed%length strings, which
have a fxed maximum length and which use the same amount of memory
whether this maximum is reached or not, and variable%length strings, whose
length is not arbitrarily fxed and which use varying amounts of memory
depending on their actual si:e. Host strings in modern programming
languages are variable%length strings. "espite the name, even variable%
length strings are limited in length, although, in general, the limit depends
only on the amount of memory available. The string length can be stored as a
separate integer (which puts a theoretical limit on the length or implicitly
through a termination character, usually a character value with all bits :ero.
7ee also >&ull%terminated> below.
;haracter encoding'edit)
7tring datatypes have historically allocated one byte per character, and,
although the exact character set varied by region, character encodings were
similar enough that programmers could often get away with ignoring this,
since characters a program treated specially (such as period and space and
comma were in the same place in all the encodings a program would
encounter. These character sets were typically based on !7;II or IA;"I;.
#ogographic languages such as ;hinese, Japanese, and 4orean (known
collectively as ;J4 need far more than 3KL characters (the limit of a one M%bit
byte per%character encoding for reasonable representation. The normal
solutions involved keeping single%byte representations for !7;II and using
two%byte representations for ;J4 ideographs. Nse of these with existing code
led to problems with matching and cutting of strings, the severity of which
depended on how the character encoding was designed. 7ome encodings
such as the IN; family guarantee that a byte value in the !7;II range will
represent only that !7;II character, making the encoding safe for systems
that use those characters as feld separators. Other encodings such as I7O%
3-33 and 7hift%JI7 do not make such guarantees, making matching on byte
codes unsafe. These encodings also were not >self%synchroni:ing>, so that
locating character boundaries required backing up to the start of a string, and
pasting two strings together could result in corruption of the second string
(these problems were much less with IN; as any !7;II character did
synchroni:e the encoding.
Nnicode has simplifed the picture somewhat. Host programming languages
now have a datatype for Nnicode strings. NnicodeDs preferred byte stream
format NT*%M is designed not to have the problems described above for older
multibyte encodings. !ll NT*%M, NT*%(L and NT*%F3 require the programmer
to know that the fxed%si:e code units are diPerent than the >characters>, the
main di@culty currently is incorrectly designed !?IDs that attempt to hide this
diPerence.
Implementations'edit)
7ome languages like ;== implement strings as templates that can be used
with any datatype, but this is the exception, not the rule.
7ome languages, such as ;== and Buby, normally allow the contents of a
string to be changed after it has been created/ these are termed mutable
strings. In other languages, such as Java and ?ython, the value is fxed and a
new string must be created if any alteration is to be made/ these are termed
immutable strings.
7trings are typically implemented as arrays of bytes, characters, or code
units, in order to allow fast access to individual units or substringsQincluding
characters when they have a fxed length. ! few languages such as Easkell
implement them as linked lists instead.
7ome languages, such as ?rolog and Irlang, avoid implementing a dedicated
string datatype at all, instead adopting the convention of representing strings
as lists of character codes.
Bepresentations'edit)
Bepresentations of strings depend heavily on the choice of character
repertoire and the method of character encoding. Older string
implementations were designed to work with repertoire and encoding defned
by !7;II, or more recent extensions like the I7O MMKR series. Hodern
implementations often use the extensive repertoire defned by Nnicode along
with a variety of complex encodings such as NT*%M and NT*%(L.
The term bytestring usually indicates a general%purpose string of bytes,
rather than strings of only (readable characters, strings of bits, or such. Ayte
strings often imply that bytes can take any value and any data can be stored
as%is, meaning that there should be no value interpreted as a termination
value.
Host string implementations are very similar to variable%length arrays with
the entries storing the character codes of corresponding characters. The
principal diPerence is that, with certain encodings, a single logical character
may take up more than one entry in the array. This happens for example with
NT*%M, where single codes (N;7 code points can take anywhere from one to
four bytes, and single characters can take an arbitrary number of codes. In
these cases, the logical length of the string (number of characters diPers
from the logical length of the array (number of bytes in use. NT*%F3 avoids
the frst part of the problem.
&ull%terminated'edit)
Hain article< &ull%terminated string
The length of a string can be stored implicitly by using a special terminating
character/ often this is the null character (&N#, which has all bits :ero, a
convention used and perpetuated by the popular ; programming language.
'F) Eence, this representation is commonly referred to as ; string.
In terminated strings, the terminating code is not an allowable character in
any string. 7trings with length feld do not have this limitation and can also
store arbitrary binary data. In ; two things are needed to handle binary data,
a character pointer and the length of the data.
!n example of a null%terminated string stored in a (-%byte buPer, along with
its !7;II (or more modern NT*%M representation as M%bit hexadecimal
numbers is<
* B ! & 4 &N# k e f w
SL(L K3(L S((L SI(L SA(L --(L LA(L LK(L LL(L TT(L
The length of the string in the above example, >*B!&4>, is K characters, but it
occupies L bytes. ;haracters after the terminator do not form part of the
representation/ they may be either part of another string or Uust garbage.
(7trings of this form are sometimes called !7;IV strings, after the original
assembly language directive used to declare them.
#ength%prefxed'edit)
The length of a string can also be stored explicitly, for example by prefxing
the string with the length as a byte value (a convention used in many ?ascal
dialects< as a consequence, some people call it a ?%string. 7toring the string
length as byte limits the maximum string length to 3KK. To avoid such
limitations, improved implementations of ?%strings use (L%, F3%, or LS%bit
words to store the string length. When the length feld covers the address
space, strings are limited only by the available memory.
Eere is the equivalent ?ascal string stored in a (-%byte buPer, along with its
!7;II X NT*%M representation<
length* B ! & 4 k e f w
K(L SL(L K3(L S((L SI(L SA(L LA(L LK(L LL(L TT(L
7trings as records'edit)
Hany languages, including obUect%oriented ones, implement strings as
records in a structure like<
class string ,
int length/
char 5text/
./
!lthough this implementation is hidden, and accessed through member
functions. The >text> will be a dynamically allocated memory area, that might
be expanded if needed. 7ee also string (;==.
#inked%list'edit)
Aoth character termination and length codes limit strings< *or example, ;
character arrays that contain null (&N# characters cannot be handled directly
by ; string library functions< 7trings using a length code are limited to the
maximum value of the length code.
Aoth of these limitations can be overcome by clever programming, of course,
but such workarounds are by defnition not standard.
Bough equivalents of the ; termination method have historically appeared in
both hardware and software. *or example, >data processing> machines like
the IAH (S-( used a special word mark bit to delimit strings at the left, where
the operation would start at the right. This meant that, while the IAH (S-(
had a seven%bit word in >reality>, almost no%one ever thought to use this as a
feature, and override the assignment of the seventh bit to (for example
handle !7;II codes.
It is possible to create data structures and functions that manipulate them
that do not have the problems associated with character termination and can
in principle overcome length code bounds. It is also possible to optimi:e the
string represented using techniques from run length encoding (replacing
repeated characters by the character value and a length and Eamming
encoding.
While these representations are common, others are possible. Nsing ropes
makes certain string operations, such as insertions, deletions, and
concatenations more e@cient.
7ecurity concerns'edit)
The diPering memory layout and storage requirements of strings can aPect
the security of the program accessing the string data. 7tring representations
requiring a terminating character are commonly susceptible to buPer
overYow problems if the terminating character is not present, caused by a
coding error or an attacker deliberately altering the data. 7tring
representations adopting a separate length feld are also susceptible if the
length can be manipulated. In such cases, program code accessing the string
data requires bounds checking to ensure that it does not inadvertently access
or change data outside of the string memory limits.
7tring data is frequently obtained from user%input to a program. !s such, it is
the responsibility of the program to validate the string to ensure that it
represents the expected format. ?erforming limited or no validation of user%
input can cause a program to be vulnerable to code inUection attacks.
Text fle strings'edit)
In computer readable text fles, for example programming language source
fles or confguration fles, strings can be represented. The &N# byte is
normally not used as terminator since that does not correspond to the !7;II
text standard, and the length is usually not stored, since the fle should be
human editable without bugs.
Two common representations are<
7urrounded by quotation marks (!7;II 33(L, used by most programming
languages. To be able to include quotation marks, newline characters etc.,
escape sequences are often available, usually using the backslash character
(!7;II K;(L.
Terminated by a newline sequence, for example in Windows I&I fles.ned
maximum length or employ dynamic allocation to allow it to hold variable
number of elements.
When a string appears literally in source code, it is known as a string literal
and has a representation that denotes it as such.'clarifcation needed)
In formal languages, which are used in mathematical logic and theoretical
computer science, a string is a fnite sequence of symbols that are chosen
from a set called an alphabet.

S-ar putea să vă placă și