Q&A: Does string::data() return a pointer valid for size() elements, or capacity() elements?

A reader asked:

In C++17, for std::string::data(), is the returned buffer valid for the range [data(), data() + size()), or is it valid for [data(), data + capacity())?

The latter seems more intuitive and what I think most people would expect reserve() to create given the non-const version of data() since C++17.

… and then helpfully included the answer, but in fairness clearly they were wondering whether cppreference.com was correct:

Relevant quote from cppreference.com: … “Returns a pointer to the underlying array serving as character storage. The pointer is such that the range [data(); data() + size()) is valid and the values in it correspond to the values stored in the string.”

Yes, cppreference.com is correct. Here’s the quote from the current draft standard:

  • 2 A specialization of basic_string is a contiguous container (22.2.1).
  • 3 In all cases, [data(), data() + size()] is a valid range, data() + size() points at an object with value charT() (a “null terminator”), and size() <= capacity() is true.

Regarding this potential alternative:

or is it valid for [data(), data + capacity())?

No, that would be strange, because it would mean intentionally supporting reading uninitialized characters in any extra raw memory at the end of the string’s current memory block.

Note that the first part of the above quote from the standard hints at the consistency issue: A string is a container, and we want containers to be consistent. We certainly wouldn’t want vector<widget>::data() to behave that way and let callers see raw memory with unconstructed objects.

The latter [… is …] what I think most people would expect reserve() to create

c/reserve/resize/ and I’ll agree :)

Any container’s size()/resize() is about the data you stored in it and that it’s holding for you. Any container’s capacity()/reserve() is about the underlying raw memory buffer just to let you help the container optimize its raw memory management, but it isn’t intended to give you access to the allocated-but-unused memory.

7 thoughts on “Q&A: Does string::data() return a pointer valid for size() elements, or capacity() elements?

  1. I suspect that the reserve/capacity mechanism is purposefully NOT a hint. It is important in some applications to be able to control when and if memory allocation occurs. Without this mechanism, one would have to roll their own string or write some code using raw char buffers.

  2. The capacity() method being public seems like a mistake. If it was hidden, reserve() could be defined to just be a hint.

  3. I agree with the reader: “The latter [… is …] what I think most people would expect reserve() to create”. There is a non-const method std::basic_string::data(). It has the caveat of “Modifying the past-the-end null terminator stored at data()+size() to any value other than CharT() has undefined behavior.”, but I disagree with that, too.

    If I’m using a 3rd party function (library, c code, etc) to populate a string, it shouldn’t be necessary to initialize the memory before the external function populates it. Ex:

    const char msg[] = "Hello, World.";
    const int len = strlen(msg);
    
    std::string s;
    s.resize(len); // or s.reserve(len);
    strcpy( s.data(), msg );
    

    Calling r.resize() results in the buffer being written twice. Calling s.reserve() (UB!) doesn’t work as I don’t think there is a mechanism to indicate the length without overwriting it.

    Though – for such a problem, it may be better to use a char array and use string_view.

  4. I think there’s a disconnect here between what the reader intended to ask and the question Herb answers, and it boils down to how you interpret the word “valid.”

    Everything Herb said is correct from the point of view of the C++ standard.

    But the standard also places constraints on the implementation (and I think this is what commenter Arne Mertz was referencing) that effectively guarantee that the memory beyond the terminator up to the capacity _exists_.

    But these constraints place some interesting constraints on that memory:

    1. data() returns a pointer to the underlying data storage.
    2. The characters are stored contiguously.
    3. An insertion invalidates iterators only if it exceeds capacity.

    I cannot imagine an implementation that meets these constraints where `data()` doesn’t return a pointer to the beginning of a buffer large enough to hold (at least) capacity() characters. Consider:

    std::string s = "abc";
    s.reserve(100);
    const char *ptr = s.data + 10;
    

    Here, `ptr` is a valid pointer. I cannot dereference it, but I can compare it to any other pointer into that same buffer (until that buffer is freed or reallocated).

    Compare that to:

    char a[] = "abc";
    const char *ptr = a + 10;
    

    Here, `ptr` is not valid, and I may have already triggered UB just for doing that pointer math. Not only can’t I dereference it, I can’t even compare it to any other pointer in my program.

  5. Additionally, an insertion that does not require a reallocation will not invalidate iterators to elements before the inserted element. At least that’s true for std::vector. std::string may still have additional invalidation semantics, but cases not exceeding capacity will surely play a role.

  6. It sounds like the reader was thinking too much about the likely implementation or, even worse, thinking about filling the string to capacity using the pointer returned by data() with const cast away. However, this does make we wonder what capacity() is really guaranteeing. If you can’t get access to the memory between size() and capacity(), what good does it do to know how big it is?

    I think I know the answer: capacity(), along with reserve(), let’s one add to the string (keeping it under capacity) while being guaranteed no memory allocation will occur. Is that it?

Comments are closed.