python中sql解析库sqlparse基本操作指南_Python

前言

sqlparse 是一个 python 库，是一个用于 python 的非验证 sql 解析器, 用于解析 sql 语句并提供一个简单的 api 来访问解析后的 sql 结构。可以帮助解析复杂的 sql 查询，提取信息，或者对 sql 语句进行一些基本的分析和操作。

一、基本方法:

sqlparse的__init__方法中提供了四个基础方法

1.parse(sql)

用于将一个或多个 sql 语句的字符串解析成 python 对象，这些对象构成了一个抽象语法树（ast）
源码

def parse(sql, encoding=none):
    """parse sql and return a list of statements.

    :param sql: a string containing one or more sql statements.
    :param encoding: the encoding of the statement (optional).
    :returns: a tuple of :class:`~sqlparse.sql.statement` instances.
    """
    return tuple(parsestream(sql, encoding))

按照符号分割sql后返回一个元组, 可以递归获取所有的值

import sqlparse

sql = """create table foo (
                 id integer primary key comment 'id_comm',
                 title varchar(200) not null comment 'id_comm',
                 description text comment 'id_comm');"""

parsed = sqlparse.parse(sql)[0]

print(parsed)

2.format(sql)

格式化代码, 返回格式化后的代码字符串源码:

def format(sql, encoding=none, **options):
    """format *sql* according to *options*.

    available options are documented in :ref:`formatting`.

    in addition to the formatting options this function accepts the
    keyword "encoding" which determines the encoding of the statement.

    :returns: the formatted sql statement as string.
    """

参数说明:

sql: 需要格式化的 sql 语句字符串。
reindent=true: 自动重新缩进 sql 语句，使代码块对齐。
keyword_case=‘upper’: 将 sql 关键字转换为大写。可选值有’lower’、‘upper’ 或 ‘capitalize’。
其他可选参数还包括 indent_width（用于设置缩进的空格数，默认为 2）、wrap_after（设置换行的字符数限制）等，以进一步定制输出样式。

import sqlparse

sql = """select * from tbl where id > 10;"""

format = sqlparse.format(sql, reindent=true, keyword_case='upper')

print(format)

# select *
# from tbl
# where id > 10;

3.split()

按照符号分割sql语句, 返回一个sql列表源码:

def split(sql, encoding=none):
    """split *sql* into single statements.

    :param sql: a string containing one or more sql statements.
    :param encoding: the encoding of the statement (optional).
    :returns: a list of strings.
    """

import sqlparse

sql = """select * from tbl where id > 10;select * from tbl where id > 20;"""

split = sqlparse.split(sql)

print(split)
# ['select * from tbl where id > 10;', 'select * from tbl where id > 20;']

4.parsestream()

类似parse方法, 流式解析sql, 它的设计初衷是为了处理从流式输入（如文件、网络连接或任何可迭代的对象）读取的 sql 代码，而不是一次性加载整个 sql 字符串到内存中。这样，在处理大型 sql 文件或连续的数据流时，可以更有效地管理内存。
源码:

def parsestream(stream, encoding=none):
    """parses sql statements from file-like object.

    :param stream: a file-like object.
    :param encoding: the encoding of the stream contents (optional).
    :returns: a generator of :class:`~sqlparse.sql.statement` instances.
    """

with open('../static/pre_sql.sql', 'r', encoding='utf-8') as file:
    for statement in sqlparse.parse(file):
        print(statement)

二、token

源码:

class token:
    """base class for all other classes in this module.

    it represents a single token and has two instance attributes:
    ``value`` is the unchanged value of the token and ``ttype`` is
    the type of the token.
    """
    
    def __init__(self, ttype, value):
    value = str(value)
    self.value = value
    self.ttype = ttype
    self.parent = none
    self.is_group = false
    self.is_keyword = ttype in t.keyword
    self.is_whitespace = self.ttype in t.whitespace
    self.normalized = value.upper() if self.is_keyword else value

sqlparse.sql.token: 这是最基本的token类，表示sql语句中的一个原子部分，如一个单词或者符号。它包含以下属性：

value: 该token的实际文本内容，比如一个关键字像select或一个标识符如表名。
token_type: 表示token类型的枚举值，比如keyword、identifier、punctuation等。
position 或 start_pos: 表示token在原始sql文本中的起始位置信息，有助于追踪token的来源。
相关token子类和概念
sqlparse.sql.identifier: 专门表示sql中的标识符，如表名、列名等。这类token可能会有额外的属性来表示是否为 quoted identifier（被引号包围的标识符）。
sqlparse.sql.keyword: 表示sql关键字，如select, from, where等。
sqlparse.sql.punctuation: 表示sql中的标点符号，如逗号,、分号;等。
sqlparse.sql.comment: 用于表示sql中的注释内容，可以是行内注释（-- …）或块注释（/* … */）。
sqlparse.sql.comparison: 包含比较操作符（如=, !=, in, between等）以及它们两边的操作数，用于构建更复杂的表达式分析。
sqlparse.sql.statement: 表示整个sql语句，通常是由多个token和其他statement对象组成的树状结构，便于递归遍历整个sql语句的结构。
这里就需要引入sql解析的过程

sql -> 语法分析器(lexer) -> token流 -> 语法分析器(parse) -> 抽象语法树(ast) -> 树结构(tree parse)

每个解析结果都会附带一个tokens 的属性，它是一个生成器，用于迭代解析后的token序列, 包含了一些类型信息, 其中的类型信息有:

# special token types
text = token.text
whitespace = text.whitespace
newline = whitespace.newline
error = token.error
# text that doesn't belong to this lexer (e.g. html in php)
other = token.other

# common token types for source code
keyword = token.keyword
name = token.name
literal = token.literal
string = literal.string
number = literal.number
punctuation = token.punctuation
operator = token.operator
comparison = operator.comparison
wildcard = token.wildcard
comment = token.comment
assignment = token.assignment

# generic types for non-source code
generic = token.generic
command = generic.command

# string and some others are not direct children of token.
# alias them:
token.token = token
token.string = string
token.number = number

# sql specific tokens
dml = keyword.dml
ddl = keyword.ddl
cte = keyword.cte

text: 基础文本类型，通常用于表示sql语句中的普通文本部分。
whitespace: 空白字符，包括空格、制表符等，用于分隔sql语句的不同部分。
newline: 特指换行符，用于标识新的一行开始。
error: 表示解析过程中遇到的无法识别或错误的文本。
other: 表示不属于当前解析器（如sql解析器）预期的文本，例如在嵌入式sql中可能遇到的其他语言（如html在php中的情况）。
keyword: sql关键字，如 select, from, where 等。
dml: 数据操作语言（data manipulation language）关键字，如 insert, update, delete, select。
ddl: 数据定义语言（data definition language）关键字，如 create, alter, drop。
cte: 公共表达式（common table expression）关键字，如 with。
name: 数据库对象名称，如表名、列名等。
literal: 字面量值，直接写在sql中的数据值。
string: 字符串字面量，如 'example string'。
number: 数字字面量，如 42, 3.14。
punctuation: 标点符号，如逗号、括号等，用于分隔或包围sql的各个部分。
operator: 操作符，如 +, -, *, /, = 等。
comparison: 比较操作符，如 =, !=, <, > 等。
wildcard: 通配符，如 % 在某些sql上下文中的使用。
comment: 注释，sql中的单行或多行注释。
assignment: 赋值操作符，如 := 在某些sql方言中用于赋值。
generic: 通用类型，适用于非特定源代码的分隔。
command: 命令，可能特指一些sql命令或交互式shell命令。

whitespace：空白字符（如空格、制表符、换行符等）
keyword：sql 关键字（如 select、from、where 等）
name：标识符（如表名、列名等）
string.single：单引号字符串字面量
string.double：双引号字符串字面量（在某些 sql 方言中用于标识符）
string.backtick：反引号字符串字面量（如 mysql 中的表名和列名）
identifier: 表示sql中的标识符，包括但不限于表名、列名、数据库名等。
compound: 复合token，可能包含多个子token，用于更复杂的结构，如 case 语句、 when 条件等。
number.integer：整数
number.float：浮点数
number.hex：十六进制数
operator：操作符（如 =、<>、+、- 等）
punctuation：标点符号（如逗号、分号、括号等）
comment.single：单行注释
comment.multiline：多行注释
wildcard：通配符（如 *）
function：函数名（如 count()、max() 等）
dml、ddl、dcl 等：表示数据操作语言、数据定义语言、数据控制语言等的高级分类

三、其他类型

有些属于token的属性

但有些不属于token, 比如where、identifierlist、identifier、parenthesis、comment等

sql = 'select 1 as id, name, case when name = "" then 3 else 4 end as score from tbl where id > 10 limit 100'

stmts = sqlparse.parse(sql)[0].tokens

for stmt in stmts:

    print(f"{type(stmt)}::{stmt.ttype}::",stmt)
# <class 'sqlparse.sql.token'>::token.keyword.dml:: select
# <class 'sqlparse.sql.token'>::token.text.whitespace::  
# <class 'sqlparse.sql.identifierlist'>::none:: 1 as id, name, case when name = "" then 3 else 4 end as score
# <class 'sqlparse.sql.token'>::token.text.whitespace::  
# <class 'sqlparse.sql.token'>::token.keyword:: from
# <class 'sqlparse.sql.token'>::token.text.whitespace::  
# <class 'sqlparse.sql.identifier'>::none:: tbl
# <class 'sqlparse.sql.token'>::token.text.whitespace::  
# <class 'sqlparse.sql.where'>::none:: where id > 10 
# <class 'sqlparse.sql.token'>::token.keyword:: limit
# <class 'sqlparse.sql.token'>::token.text.whitespace::  
# <class 'sqlparse.sql.token'>::token.literal.number.integer:: 100

当查询有多列或者有多表时, 会将其封装为identifierlist, 单表时候会被封装为identifier, 过滤条件被封装为where, 括号会被封装为parenthesis, 注释会被封装为comment

四、案例: 提取所有查询的字段和表名

import sqlparse
import re

sql = 'insert into table inser_tbl partition (dt = dt) select 1 as id, name, case when （name = "" or name = "") then 3 else 4 end as score from tbl where id > 10 limit 100'

stmts = sqlparse.parse(sql)[0].tokens

cols = []
tbls = []
froms = []
wheres = []
last_key = ''
for stmt in stmts:
    if stmt.value == 'insert' or stmt.value == 'select' or stmt.value == 'from':
        last_key = stmt.value
    # 剔除空格和换行
    if stmt.ttype is sqlparse.tokens.text.whitespace:
        continue
    # 关键字
    elif stmt.ttype is sqlparse.tokens.keyword.dml:
        dml = stmt.value
        last_key = dml
    # 字段
    elif isinstance(stmt, sqlparse.sql.identifierlist):
        # 判断上一个是什么类型
        if last_key == 'select':
            for identifier in stmt.get_identifiers():
                col_name = identifier.value
                if re.search('as', col_name, re.i):
                    col_name = re.search('as (.*)', col_name, re.i).group(1).strip()
                cols.append(col_name)
        elif last_key == 'from':
            for identifier in stmt.get_identifiers():
                froms.append(identifier.value)
        else:
            for identifier in stmt.get_identifiers():
                tbls.append(identifier.value)
    elif isinstance(stmt, sqlparse.sql.identifier):
        if last_key == 'select':
            cols.append(stmt.value)
        elif last_key == 'from':
            froms.append(stmt.value)
        else:
            tbls.append(stmt.value)
    elif isinstance(stmt, sqlparse.sql.where):
        wheres.append(stmt.value)
    # 表名
print("cols:", cols)
print("tbls:", tbls)
print("froms:", froms)
print("wheres:", wheres)

# cols: ['id', 'name', 'score']
# tbls: ['inser_tbl']
# froms: ['tbl']
# wheres: ['where id > 10 ']