/dev/null/onishy

どーも。

Hacking the VLC - How the VLC media player works

This is a translation of my previous article from Japanese.

So, let's begin with analyzing how the vlc media player is working.

The version I used is 2.2.1, which can be downloaded from here. https://www.videolan.org/vlc/download-sources.html

Multi-Threading of VLC

Since VLC is highly multi-threaded program, I must first explain the role of each thread. The important threads to note are the following:

  • Playlist: src/playlist/thread.c
    • does managing of the playlist
  • Video Output: src/video_output/video_output.c
    • does output of the video
  • Decoder: src/input/decoder.c
    • does decoding of the video. Filters are also applied here.
  • The drawing thread (i.e. Qt)

In fact there seems to be more than 10 additional threads running, but I couldn't follow all of them.

If you are interested in those threads, you might want to watch the function "vlc_clone", which is called every time the new thread is created. But be sure that, since VLC is designed to work on multi-platform, it depends on the device which GUI thread is actually called.

Video Processing of VLC

Whoo! Now we are ready to understand how the video is processed inside the VLC!

The video is processed in the following steps:

  1. The video is loaded by Playlist thread.
    • The signals (Play/Stop, Pause, etc.) from GUI thread is received and processed here.
  2. The Decoder thread decodes the video with ffmpeg.
    • The decoded frames are then pushed to picture_fifo.
  3. VideoOutput thread outputs the frame as decoding goes on.
    • the frame is popped from the fifo.
    • the process decides whether the frame should be displayed or not based on the timestamp.

This is basically all what VLC does.

The picture_fifo mentioned above is the fifo of decoded pictures. The frames are pushed in order decoded and popped when played. See picture_fifo.c for further implementations.

There is another struct similar to picture_fifo - picture_pool. This is written in picture_pool.c. It seems like it does bind some pictures or mutex-and-gc-like things. I didn't need to touch these components, so I don't know what it actually does.

Structs

There are many VLC-unique structs defined. Fortunately, today there are BUNCH of libraries floating around which make me not write these stuffs. But it wasn't that easy in 2001. These unique structures make the VLC sources very very complicated.

The astronomical numbers of structs are defined in VLC, but just let me list the ones I think are important (and interesting).

vlc_object_t: Object

Defined in vlc_common.h. The main function can be found in vlc_variables.h.

Since the VLC was developed in 2001, and maybe C++ wan't popular then, the core of VLC is written in C. But I can't imagine how hard to write such a huge program without object-oriented language.

Therefore, VLC has written all those object-oriented structures only with C. Maybe developers were too mad that they couldn't have private variables and member functions in C structs.

I guess that the developers were too happy to use those structures then, but they are slightly outdated today. Hmm...

vlc_object_t creates an object as it is given the type and the name string. The content of an object is described in vlc_value_t union as following:

/**
 * VLC value structure
 */
typedef union
{
    int64_t         i_int;
    bool            b_bool;
    float           f_float;
    char *          psz_string;
    void *          p_address;
    vlc_object_t *  p_object;
    vlc_list_t *    p_list;
    mtime_t         i_time;
    struct { int32_t x; int32_t y; } coords;

} vlc_value_t;

vlc_variables.h has many functions defined for each of the supported types. The functions are given the prefix of var_ as a namespace. All basic variables used in VLC are using this structure.

block_t, picture_t: Pictures

VLC reads the video by block, and writes by frame. block_t and picture_t are defined for these purposes.

block_t is a multi-purpose data block and defined in vlc_block.h:

struct block_t
{
    block_t    *p_next;

    uint8_t    *p_buffer; /**< Payload start */
    size_t      i_buffer; /**< Payload length */
    uint8_t    *p_start; /**< Buffer start */
    size_t      i_size; /**< Buffer total size */

    uint32_t    i_flags;
    unsigned    i_nb_samples; /* Used for audio */

    mtime_t     i_pts;
    mtime_t     i_dts;
    mtime_t     i_length;

    /* Rudimentary support for overloading block (de)allocation. */
    block_free_t pf_release;
};

As you can see, block_t is a lineare list, which can store encoding/decoding information in pts/dts. Thus, block_t is used for managing data by or transfer data in block.

It is obvious that picture_t stores a picture. The purpose of picture_t is limited to storing decoded frames, and single object itself contains enough information for displaying the picture.

/**
 * Video picture
 */
struct picture_t
{
    /**
     * The properties of the picture
     */
    video_frame_format_t format;

    plane_t         p[PICTURE_PLANE_MAX];     /**< description of the planes */
    int             i_planes;                /**< number of allocated planes */

    /** \name Picture management properties
     * These properties can be modified using the video output thread API,
     * but should never be written directly */
    /**@{*/
    mtime_t         date;                                  /**< display date */
    bool            b_force;
    /**@}*/

    /** \name Picture dynamic properties
     * Those properties can be changed by the decoder
     * @{
     */
    bool            b_progressive;          /**< is it a progressive frame ? */
    bool            b_top_field_first;             /**< which field is first */
    unsigned int    i_nb_fields;                  /**< # of displayed fields */
    void          * context;          /**< video format-specific data pointer,
             * must point to a (void (*)(void*)) pointer to free the context */
    /**@}*/

    /** Private data - the video output plugin might want to put stuff here to
     * keep track of the picture */
    picture_sys_t * p_sys;

    /** This way the picture_Release can be overloaded */
    struct
    {
        atomic_uintptr_t refcount;
        void (*pf_destroy)( picture_t * );
        picture_gc_sys_t *p_sys;
    } gc;

    /** Next picture in a FIFO a pictures */
    struct picture_t *p_next;
};

You can think of IplImage in OpenCV. Yet, picture_t has no such capability that it can be directly casted to IplImage. I won't explain the reason today, so wait for a while.

Terms

  • PTS/DTS
    • PTS is an abbreviation for Presentation Time Stamp, and presents the time when the frame SHOULD be played.
    • DTS stands for Decode Time Stamp, which is the time the frame was decoded.
  • I/P/B frame
    • The idea of I/P/B frames is that the frames can be classified to 3 types, I/P/B. This method is used widely, including ffmpeg.
    • Briefly explaining, there are I frames which contain all pixel information. Between I frames, there 2 types of frames, P and B, which differ in compressing level. The P frames are softly compressed and has DIFFERENCES from the previous I or P frame. B frames are hardly compressed frames between P frames, which contains complementary infomation between them. B frame can only be decoded by referencing to previous and next P frames.
    • By using this method, the process can decode the frames under both slow and fast environments with a good quality. (from Wikipedia)
    • decoder_synchro.c has further descriptions.
/*
 * DISCUSSION : How to Write an efficient Frame-Dropping Algorithm
 * ==========
 *
 * This implementation is based on mathematical and statistical
 * developments. Older implementations used an enslavement, considering
 * that if we're late when reading an I picture, we will decode one frame
 * less. It had a tendancy to derive, and wasn't responsive enough, which
 * would have caused trouble with the stream control stuff.
 *
 * 1. Structure of a picture stream
 *    =============================
 * Between 2 I's, we have for instance :
 *    I   B   P   B   P   B   P   B   P   B   P   B   I
 *    t0  t1  t2  t3  t4  t5  t6  t7  t8  t9  t10 t11 t12
 * Please bear in mind that B's and IP's will be inverted when displaying
 * (decoding order != presentation order). Thus, t1 < t0.
 *
 * 2. Definitions
 *    ===========
 * t[0..12]     : Presentation timestamps of pictures 0..12.
 * t            : Current timestamp, at the moment of the decoding.
 * T            : Picture period, T = 1/frame_rate.
 * tau[I,P,B]   : Mean time to decode an [I,P,B] picture.
 * tauYUV       : Mean time to render a picture (given by the video_output).
 * tau´[I,P,B] = 2 * tau[I,P,B] + tauYUV
 *              : Mean time + typical difference (estimated to tau/2, that
 *                needs to be confirmed) + render time.
 * DELTA        : A given error margin.
 *
 * 3. General considerations
 *    ======================
 * We define three types of machines :
 *      14T > tauI : machines capable of decoding all I pictures
 *      2T > tauP  : machines capable of decoding all P pictures
 *      T > tauB   : machines capable of decoding all B pictures
 *
 * 4. Decoding of an I picture
 *    ========================
 * On fast machines, we decode all I's.
 * Otherwise :
 * We can decode an I picture if we simply have enough time to decode it
 * before displaying :
 *      t0 - t > tau´I + DELTA
 *
 * 5. Decoding of a P picture
 *    =======================
 * On fast machines, we decode all P's.
 * Otherwise :
 * First criterion : have time to decode it.
 *      t2 - t > tau´P + DELTA
 *
 * Second criterion : it shouldn't prevent us from displaying the forthcoming
 * I picture, which is more important.
 *      t12 - t > tau´P + tau´I + DELTA
 *
 * 6. Decoding of a B picture
 *    =======================
 * On fast machines, we decode all B's. Otherwise :
 *      t1 - t > tau´B + DELTA
 * Since the next displayed I or P is already decoded, we don't have to
 * worry about it.
 *
 * I hope you will have a pleasant flight and do not forget your life
 * jacket.
 *                                                  --Meuuh (2000-12-29)
 */

This is how VLC can play the video on both fast and slow machines.

I will add more info if I could recall more...

VLC弄ってみた#1 - VLCの仕組み

初回の今回はVLCの内部構造について分析したことを紹介します。大学のレポートを兼ねた記事なので少し詳しすぎめに書きます。

tarballはこちらから入手でき、解析にはversion 2.2.1を用いました。

役に立つ資料

逆にこれくらいしか役に立つ資料がない…

VLCのスレッド管理

VLCは高度にマルチスレッド化されており、例えば、重要なスレッドとしては以下が挙げられます。

  • Playlist: src/playlist/thread.c
    • 再生リストを管理するスレッド。
  • Video Output: src/video_output/video_output.c
    • 動画の出力を行うスレッド。
  • Decoder: src/input/decoder.c
    • 動画のデコードを行うスレッド。フィルタ管理も行う。
  • Qtなどの描画処理を行うスレッド。

当然、実際には音声関連やネットワーク関連など、もっと多くのスレッドが動いていますが、さすがに追いきれません。

vlcでは、スレッドの生成はvlc_cloneという関数を通じて行われます。詳しく調べたい場合はこの関数を追うと良いでしょう。

なお、VLCマルチプラットフォームに対応しているため、スレッド管理やGUI回りでは、OSごとに違う関数が呼ばれているので注意しましょう。

VLCの動画処理

上のスレッドでも触れた4つのスレッドが、動画処理の肝となるスレッドです。

VLCが動画を処理する流れは次のようになっています。

  1. Playlistから動画が読み込まれる。
    • GUIからの再生/停止の信号処理、再生管理はこいつが行う。
  2. Decoderスレッドがffmpegを利用して動画をどんどんデコードする。
    • デコードされたフレームはpicture_fifoにpushする。
  3. VideoOutputのスレッドがDecodeと並行して動画を出力する。
    • フレームはfifoからpopされる。
    • この際、フレームにタグ付けられた時間情報から、出力されるべきフレームかどうかが決定される。

基本的にやっていることはこれだけです。

ここで、picture_fifoは何かというと、名前の通りデコードされた画像のFIFOです。デコードされた順にpushされ、再生する際にpopされます。実装はpicture_fifo.cに書かれています。

fifoと似た構造体に、picture_poolというものもあります。こちらはpicture_pool.cに実装があります。画像をまとめて管理したり、mutexやgc的な処理を行ったりしているような雰囲気も出していますが、残念ながらあまり直接触る機会はなく、深入りはしませんでした。

構造体

VLCでは独自の構造体が数多く定義されています。現代であれば、恐らくライブラリを使って補うような部分も、全て自前で書いてあります。これがVLCを複雑なソフトウェアにしている一因となっています。

構造体は死ぬほど定義されていますが、ここでは特によく見かける(興味深い)構造体を挙げておきます。

vlc_object_t: オブジェクト

vlc_common.hで定義されていて、主な役割はvlc_variables.hを見るとわかります。

VLCは2001年に開発されたソフトウェアで、当時はC++が主流でなかったのか分かりませんが、コアはC言語で書かれています。しかしこれだけ大規模なソフトウェアとなると、オブジェクト指向なしに記述するのは至難の業です。

そこで、VLCでは何を血迷ったか、独自でオブジェクト指向的なクラスを書き上げてしまいました。構造体ではprivateなメンバやメンバ関数が持てないので、苦肉の策として書いたのでしょう。

恐らく当時はワッショイワッショイ言いながら使われていたのでしょうが、今となっては少々使いにくいと言わざるを得ません。

vlc_object_tは型と名前の文字列などを与えられていて、実際の中身は次のようなunionで定義されるvlc_value_tです。

/**
 * VLC value structure
 */
typedef union
{
    int64_t         i_int;
    bool            b_bool;
    float           f_float;
    char *          psz_string;
    void *          p_address;
    vlc_object_t *  p_object;
    vlc_list_t *    p_list;
    mtime_t         i_time;
    struct { int32_t x; int32_t y; } coords;

} vlc_value_t;

vlc_variables.hには、それぞれの型に対応する関数がずらりと並べられていて壮観です。変数関連の関数には、名前空間としてvar_のプレフィクスが与えられています。基本的に、VLCの変数はこのオブジェクトによって管理されています。

block_t, picture_t: 画像関連

VLCでは、動画ファイルをブロック単位で読み込み、フレーム単位で出力します。そのために用意された構造体がblock_tやpicture_tです。

block_tは汎用的なブロックを表す構造体で、vlc_block.hで次のように定義されています。

struct block_t
{
    block_t    *p_next;

    uint8_t    *p_buffer; /**< Payload start */
    size_t      i_buffer; /**< Payload length */
    uint8_t    *p_start; /**< Buffer start */
    size_t      i_size; /**< Buffer total size */

    uint32_t    i_flags;
    unsigned    i_nb_samples; /* Used for audio */

    mtime_t     i_pts;
    mtime_t     i_dts;
    mtime_t     i_length;

    /* Rudimentary support for overloading block (de)allocation. */
    block_free_t pf_release;
};

見ての通り、線形リストで、pts/dts[用語解説]を含んでいるため、エンコード/デコード情報を格納することができます。これらのデータをブロック単位で管理したり、送受する場合に使われる構造体です。

picture_tは、その名の通り画像を格納します。こちらはデコード済みの画像限定で、このオブジェクト1つで画像の表示に必要な情報は全て揃っています。

/**
 * Video picture
 */
struct picture_t
{
    /**
     * The properties of the picture
     */
    video_frame_format_t format;

    plane_t         p[PICTURE_PLANE_MAX];     /**< description of the planes */
    int             i_planes;                /**< number of allocated planes */

    /** \name Picture management properties
     * These properties can be modified using the video output thread API,
     * but should never be written directly */
    /**@{*/
    mtime_t         date;                                  /**< display date */
    bool            b_force;
    /**@}*/

    /** \name Picture dynamic properties
     * Those properties can be changed by the decoder
     * @{
     */
    bool            b_progressive;          /**< is it a progressive frame ? */
    bool            b_top_field_first;             /**< which field is first */
    unsigned int    i_nb_fields;                  /**< # of displayed fields */
    void          * context;          /**< video format-specific data pointer,
             * must point to a (void (*)(void*)) pointer to free the context */
    /**@}*/

    /** Private data - the video output plugin might want to put stuff here to
     * keep track of the picture */
    picture_sys_t * p_sys;

    /** This way the picture_Release can be overloaded */
    struct
    {
        atomic_uintptr_t refcount;
        void (*pf_destroy)( picture_t * );
        picture_gc_sys_t *p_sys;
    } gc;

    /** Next picture in a FIFO a pictures */
    struct picture_t *p_next;
};

基本的にOpenCVのIplImageのようなものを想像すると良いです。 ただ、IplImageにそのままキャストできるほどの互換性はありません。これについては次回以降に詳しく説明する予定です。

用語解説

  • PTS/DTS
    • PTSはPresentation Time Stampの略で、動画が実際に再生されるべきタイミングを表す。
    • DTSはDecode Time Stampの略で、動画がデコードされたタイミングを表す。
  • I/P/B frame
    • 動画ファイルのフレームを3種類に分ける。
      • ffmpegなどで一般に用いられている手法。
    • ざっくり言うと、全ての画像情報が含まれたIフレームの間に、圧縮率の異なる2つのフレームが存在して、PフレームBフレームと呼ばれている。圧縮率はB > Pで、Pフレームはひとつ前のPまたはIフレームからの差分を持っている。BフレームPフレームの間を補完するための情報を持っていて、ひとつ前と後のどちらのフレームの情報も参照する。
    • こうすることで、例えば極端に処理速度の遅いコンピュータではIフレームのみ、余裕があればPフレームBフレームをデコードするようにすれば、スペックに応じたクオリティのデコードができる仕組みになっているらしい。(from Wikipedia)
    • decoder_synchro.cに詳細が書いてある。
/*
 * DISCUSSION : How to Write an efficient Frame-Dropping Algorithm
 * ==========
 *
 * This implementation is based on mathematical and statistical
 * developments. Older implementations used an enslavement, considering
 * that if we're late when reading an I picture, we will decode one frame
 * less. It had a tendancy to derive, and wasn't responsive enough, which
 * would have caused trouble with the stream control stuff.
 *
 * 1. Structure of a picture stream
 *    =============================
 * Between 2 I's, we have for instance :
 *    I   B   P   B   P   B   P   B   P   B   P   B   I
 *    t0  t1  t2  t3  t4  t5  t6  t7  t8  t9  t10 t11 t12
 * Please bear in mind that B's and IP's will be inverted when displaying
 * (decoding order != presentation order). Thus, t1 < t0.
 *
 * 2. Definitions
 *    ===========
 * t[0..12]     : Presentation timestamps of pictures 0..12.
 * t            : Current timestamp, at the moment of the decoding.
 * T            : Picture period, T = 1/frame_rate.
 * tau[I,P,B]   : Mean time to decode an [I,P,B] picture.
 * tauYUV       : Mean time to render a picture (given by the video_output).
 * tau´[I,P,B] = 2 * tau[I,P,B] + tauYUV
 *              : Mean time + typical difference (estimated to tau/2, that
 *                needs to be confirmed) + render time.
 * DELTA        : A given error margin.
 *
 * 3. General considerations
 *    ======================
 * We define three types of machines :
 *      14T > tauI : machines capable of decoding all I pictures
 *      2T > tauP  : machines capable of decoding all P pictures
 *      T > tauB   : machines capable of decoding all B pictures
 *
 * 4. Decoding of an I picture
 *    ========================
 * On fast machines, we decode all I's.
 * Otherwise :
 * We can decode an I picture if we simply have enough time to decode it
 * before displaying :
 *      t0 - t > tau´I + DELTA
 *
 * 5. Decoding of a P picture
 *    =======================
 * On fast machines, we decode all P's.
 * Otherwise :
 * First criterion : have time to decode it.
 *      t2 - t > tau´P + DELTA
 *
 * Second criterion : it shouldn't prevent us from displaying the forthcoming
 * I picture, which is more important.
 *      t12 - t > tau´P + tau´I + DELTA
 *
 * 6. Decoding of a B picture
 *    =======================
 * On fast machines, we decode all B's. Otherwise :
 *      t1 - t > tau´B + DELTA
 * Since the next displayed I or P is already decoded, we don't have to
 * worry about it.
 *
 * I hope you will have a pleasant flight and do not forget your life
 * jacket.
 *                                                  --Meuuh (2000-12-29)
 */

要するに、処理速度に応じてどのフレームをデコードするかを決めているという話。

何か思い出したら書き足します。